1,367 research outputs found

    FDTD/K-DWM simulation of 3D room acoustics on general purpose graphics hardware using compute unified device architecture (CUDA)

    Get PDF
    The growing demand for reliable prediction of sound fields in rooms have resulted in adaptation of various approaches for physical modeling, including the Finite Difference Time Domain (FDTD) and the Digital Waveguide Mesh (DWM). Whilst considered versatile and attractive methods, they suffer from dispersion errors that increase with frequency and vary with direction of propagation, thus imposing a high frequency calculation limit. Attempts have been made to reduce such errors by considering different mesh topologies, by spatial interpolation, or by simply oversampling the grid. As the latter approach is computationally expensive, its application to three-dimensional problems has often been avoided. In this paper, we propose an implementation of the FDTD on general purpose graphics hardware, allowing for high sampling rates whilst maintaining reasonable calculation times. Dispersion errors are consequently reduced and the high frequency limit is increased. A range of graphics processors are evaluated and compared with traditional CPUs in terms of accuracy, calculation time and memory requirements

    Preemptive Thread Block Scheduling with Online Structural Runtime Prediction for Concurrent GPGPU Kernels

    Full text link
    Recent NVIDIA Graphics Processing Units (GPUs) can execute multiple kernels concurrently. On these GPUs, the thread block scheduler (TBS) uses the FIFO policy to schedule their thread blocks. We show that FIFO leaves performance to chance, resulting in significant loss of performance and fairness. To improve performance and fairness, we propose use of the preemptive Shortest Remaining Time First (SRTF) policy instead. Although SRTF requires an estimate of runtime of GPU kernels, we show that such an estimate of the runtime can be easily obtained using online profiling and exploiting a simple observation on GPU kernels' grid structure. Specifically, we propose a novel Structural Runtime Predictor. Using a simple Staircase model of GPU kernel execution, we show that the runtime of a kernel can be predicted by profiling only the first few thread blocks. We evaluate an online predictor based on this model on benchmarks from ERCBench, and find that it can estimate the actual runtime reasonably well after the execution of only a single thread block. Next, we design a thread block scheduler that is both concurrent kernel-aware and uses this predictor. We implement the SRTF policy and evaluate it on two-program workloads from ERCBench. SRTF improves STP by 1.18x and ANTT by 2.25x over FIFO. When compared to MPMax, a state-of-the-art resource allocation policy for concurrent kernels, SRTF improves STP by 1.16x and ANTT by 1.3x. To improve fairness, we also propose SRTF/Adaptive which controls resource usage of concurrently executing kernels to maximize fairness. SRTF/Adaptive improves STP by 1.12x, ANTT by 2.23x and Fairness by 2.95x compared to FIFO. Overall, our implementation of SRTF achieves system throughput to within 12.64% of Shortest Job First (SJF, an oracle optimal scheduling policy), bridging 49% of the gap between FIFO and SJF.Comment: 14 pages, full pre-review version of PACT 2014 poste

    Performance analysis and optimization of automotive GPUs

    Get PDF
    © 2019 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes,creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.Advanced Driver Assistance Systems (ADAS) and Autonomous Driving (AD) have drastically increased the performance demands of automotive systems. Suitable highperformance platforms building upon Graphic Processing Units (GPUs) have been developed to respond to this demand, being NVIDIA Jetson TX2 a relevant representative. However, whether high-performance GPU configurations are appropriate for automotive setups remains as an open question. This paper aims at providing light on this question by modelling an automotive GPU (Jetson TX2), analyzing its microarchitectural parameters against relevant benchmarks, and identifying specific configurations able to meaningfully increase performance within similar cost envelopes, or to decrease costs preserving original performance levels. Overall, our analysis opens the door to the optimization of automotive GPUs for further system efficiency.This work has been partially supported by the Spanish Ministry of Economy and Competitiveness (MINECO) under grant TIN2015-65316-P, the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No. 772773) and the HiPEAC Network of Excellence. Pedro Benedicte and Jaume Abella have been partially supported by the MINECO under FPU15/01394 grant and Ramon y Cajal postdoctoral fellowship number RYC-2013-14717 respectively and Leonidas Kosmidis under Juan de la Cierva-Formacin postdoctoral fellowship (FJCI-2017-34095).Peer ReviewedPostprint (author's final draft

    PERFORMANCE ANALYSIS AND FITNESS OF GPGPU AND MULTICORE ARCHITECTURES FOR SCIENTIFIC APPLICATIONS

    Get PDF
    Recent trends in computing architecture development have focused on exploiting task- and data-level parallelism from applications. Major hardware vendors are experimenting with novel parallel architectures, such as the Many Integrated Core (MIC) from Intel that integrates 50 or more x86 processors on a single chip, the Accelerated Processing Unit from AMD that integrates a multicore x86 processor with a graphical processing unit (GPU), and many other initiatives from other hardware vendors that are underway. Therefore, various types of architectures are available to developers for accelerating an application. A performance model that predicts the suitability of the architecture for accelerating an application would be very helpful prior to implementation. Thus, in this research, a Fitness model that ranks the potential performance of accelerators for an application is proposed. Then the Fitness model is extended using statistical multiple regression to model both the runtime performance of accelerators and the impact of programming models on accelerator performance with high degree of accuracy. We have validated both performance models for all the case studies. The error rate of these models, calculated using the experimental performance data, is tolerable in the high-performance computing field. In this research, to develop and validate the two performance models we have also analyzed the performance of several multicore CPUs and GPGPU architectures and the corresponding programming models using multiple case studies. The first case study used in this research is a matrix-matrix multiplication algorithm. By varying the size of the matrix from a small size to a very large size, the performance of the multicore and GPGPU architectures are studied. The second case study used in this research is a biological spiking neural network (SNN), implemented with four neuron models that have varying requirements for communication and computation making them useful for performance analysis of the hardware platforms. We report and analyze the performance variation of the four popular accelerators (Intel Xeon, AMD Opteron, Nvidia Fermi, and IBM PS3) and four advanced CPU architectures (Intel 32 core, AMD 32 core, IBM 16 core, and SUN 32 core) with problem size (matrix and network size) scaling, available optimization techniques and execution configuration. This thorough analysis provides insight regarding how the performance of an accelerator is affected by problem size, optimization techniques, and accelerator configuration. We have analyzed the performance impact of four popular multicore parallel programming models, POSIX-threading, Open Multi-Processing (OpenMP), Open Computing Language (OpenCL), and Concurrency Runtime on an Intel i7 multicore architecture; and, two GPGPU programming models, Compute Unified Device Architecture (CUDA) and OpenCL, on a NVIDIA GPGPU. With the broad study conducted using a wide range of application complexity, multiple optimizations, and varying problem size, it was found that according to their achievable performance, the programming models for the x86 processor cannot be ranked across all applications, whereas the programming models for GPGPU can be ranked conclusively. We also have qualitatively and quantitatively ranked all the six programming models in terms of their perceived programming effort. The results and analysis in this research indicate and are supported by the proposed performance models that for a given hardware system, the best performance for an application is obtained with a proper match of programming model and architecture
    • …
    corecore