597 research outputs found

    q-State Potts model metastability study using optimized GPU-based Monte Carlo algorithms

    Get PDF
    We implemented a GPU based parallel code to perform Monte Carlo simulations of the two dimensional q-state Potts model. The algorithm is based on a checkerboard update scheme and assigns independent random numbers generators to each thread. The implementation allows to simulate systems up to ~10^9 spins with an average time per spin flip of 0.147ns on the fastest GPU card tested, representing a speedup up to 155x, compared with an optimized serial code running on a high-end CPU. The possibility of performing high speed simulations at large enough system sizes allowed us to provide a positive numerical evidence about the existence of metastability on very large systems based on Binder's criterion, namely, on the existence or not of specific heat singularities at spinodal temperatures different of the transition one.Comment: 30 pages, 7 figures. Accepted in Computer Physics Communications. code available at: http://www.famaf.unc.edu.ar/grupos/GPGPU/Potts/CUDAPotts.htm

    The ESCAPE project : Energy-efficient Scalable Algorithms for Weather Prediction at Exascale

    Get PDF
    In the simulation of complex multi-scale flows arising in weather and climate modelling, one of the biggest challenges is to satisfy strict service requirements in terms of time to solution and to satisfy budgetary constraints in terms of energy to solution, without compromising the accuracy and stability of the application. These simulations require algorithms that minimise the energy footprint along with the time required to produce a solution, maintain the physically required level of accuracy, are numerically stable, and are resilient in case of hardware failure. The European Centre for Medium-Range Weather Forecasts (ECMWF) led the ESCAPE (Energy-efficient Scalable Algorithms for Weather Prediction at Exascale) project, funded by Horizon 2020 (H2020) under the FET-HPC (Future and Emerging Technologies in High Performance Computing) initiative. The goal of ESCAPE was to develop a sustainable strategy to evolve weather and climate prediction models to next-generation computing technologies. The project partners incorporate the expertise of leading European regional forecasting consortia, university research, experienced high-performance computing centres, and hardware vendors. This paper presents an overview of the ESCAPE strategy: (i) identify domain-specific key algorithmic motifs in weather prediction and climate models (which we term Weather & Climate Dwarfs), (ii) categorise them in terms of computational and communication patterns while (iii) adapting them to different hardware architectures with alternative programming models, (iv) analyse the challenges in optimising, and (v) find alternative algorithms for the same scheme. The participating weather prediction models are the following: IFS (Integrated Forecasting System); ALARO, a combination of AROME (Application de la Recherche a l'Operationnel a Meso-Echelle) and ALADIN (Aire Limitee Adaptation Dynamique Developpement International); and COSMO-EULAG, a combination of COSMO (Consortium for Small-scale Modeling) and EULAG (Eulerian and semi-Lagrangian fluid solver). For many of the weather and climate dwarfs ESCAPE provides prototype implementations on different hardware architectures (mainly Intel Skylake CPUs, NVIDIA GPUs, Intel Xeon Phi, Optalysys optical processor) with different programming models. The spectral transform dwarf represents a detailed example of the co-design cycle of an ESCAPE dwarf. The dwarf concept has proven to be extremely useful for the rapid prototyping of alternative algorithms and their interaction with hardware; e.g. the use of a domain-specific language (DSL). Manual adaptations have led to substantial accelerations of key algorithms in numerical weather prediction (NWP) but are not a general recipe for the performance portability of complex NWP models. Existing DSLs are found to require further evolution but are promising tools for achieving the latter. Measurements of energy and time to solution suggest that a future focus needs to be on exploiting the simultaneous use of all available resources in hybrid CPU-GPU arrangements

    High Performance Computing via High Level Synthesis

    Get PDF
    As more and more powerful integrated circuits are appearing on the market, more and more applications, with very different requirements and workloads, are making use of the available computing power. This thesis is in particular devoted to High Performance Computing applications, where those trends are carried to the extreme. In this domain, the primary aspects to be taken into consideration are (1) performance (by definition) and (2) energy consumption (since operational costs dominate over procurement costs). These requirements can be satisfied more easily by deploying heterogeneous platforms, which include CPUs, GPUs and FPGAs to provide a broad range of performance and energy-per-operation choices. In particular, as we will see, FPGAs clearly dominate both CPUs and GPUs in terms of energy, and can provide comparable performance. An important aspect of this trend is of course design technology, because these applications were traditionally programmed in high-level languages, while FPGAs required low-level RTL design. The OpenCL (Open Computing Language) developed by the Khronos group enables developers to program CPU, GPU and recently FPGAs using functionally portable (but sadly not performance portable) source code which creates new possibilities and challenges both for research and industry. FPGAs have been always used for mid-size designs and ASIC prototyping thanks to their energy efficient and flexible hardware architecture, but their usage requires hardware design knowledge and laborious design cycles. Several approaches are developed and deployed to address this issue and shorten the gap between software and hardware in FPGA design flow, in order to enable FPGAs to capture a larger portion of the hardware acceleration market in data centers. Moreover, FPGAs usage in data centers is growing already, regardless of and in addition to their use as computational accelerators, because they can be used as high performance, low power and secure switches inside data-centers. High-Level Synthesis (HLS) is the methodology that enables designers to map their applications on FPGAs (and ASICs). It synthesizes parallel hardware from a model originally written C-based programming languages .e.g. C/C++, SystemC and OpenCL. Design space exploration of the variety of implementations that can be obtained from this C model is possible through wide range of optimization techniques and directives, e.g. to pipeline loops and partition memories into multiple banks, which guide RTL generation toward application dependent hardware and benefit designers from flexible parallel architecture of FPGAs. Model Based Design (MBD) is a high-level and visual process used to generate implementations that solve mathematical problems through a varied set of IP-blocks. MBD enables developers with different expertise, e.g. control theory, embedded software development, and hardware design to share a common design framework and contribute to a shared design using the same tool. Simulink, developed by MATLAB, is a model based design tool for simulation and development of complex dynamical systems. Moreover, Simulink embedded code generators can produce verified C/C++ and HDL code from the graphical model. This code can be used to program micro-controllers and FPGAs. This PhD thesis work presents a study using automatic code generator of Simulink to target Xilinx FPGAs using both HDL and C/C++ code to demonstrate capabilities and challenges of high-level synthesis process. To do so, firstly, digital signal processing unit of a real-time radar application is developed using Simulink blocks. Secondly, generated C based model was used for high level synthesis process and finally the implementation cost of HLS is compared to traditional HDL synthesis using Xilinx tool chain. Alternative to model based design approach, this work also presents an analysis on FPGA programming via high-level synthesis techniques for computationally intensive algorithms and demonstrates the importance of HLS by comparing performance-per-watt of GPUs(NVIDIA) and FPGAs(Xilinx) manufactured in the same node running standard OpenCL benchmarks. We conclude that generation of high quality RTL from OpenCL model requires stronger hardware background with respect to the MBD approach, however, the availability of a fast and broad design space exploration ability and portability of the OpenCL code, e.g. to CPUs and GPUs, motivates FPGA industry leaders to provide users with OpenCL software development environment which promises FPGA programming in CPU/GPU-like fashion. Our experiments, through extensive design space exploration(DSE), suggest that FPGAs have higher performance-per-watt with respect to two high-end GPUs manufactured in the same technology(28 nm). Moreover, FPGAs with more available resources and using a more modern process (20 nm) can outperform the tested GPUs while consuming much less power at the cost of more expensive devices

    Improving Mobile SOC\u27s Performance as an Energy Efficient DSP Platform with Heterogeneous Computing

    Get PDF
    Mobile system-on-chip (SOC) technology is improving at a staggering rate spurred primarily by the adoption of smartphones and tablets. This rapid innovation has allowed the mobile SOC to be considered in everything from high performance computing to embedded applications. In this work, modern SOC\u27s heterogeneous computing capabilities are evaluated with a focus toward digital signal processing (DSP). Evaluation is conducted on modern consumer devices running Android operating system and leveraging the relatively new RenderScript Compute to utilize CPU resources alongside other compute resources such as graphics processing units (GPUs) and digital signal processors. In order to benchmark these concepts, several implementations of both the discrete Fourier transform (DFT) and the fast Fourier transform (FFT) are tested across devices. The results show both improvement in performance and energy efficiency on many devices compared to traditional Java implementations and indicate that the mobile SOC is a relevant platform for DSP applications

    GPU ์—๋Ÿฌ ์•ˆ์ •์„ฑ ๋ณด์žฅ์„ ์œ„ํ•œ ์ปดํŒŒ์ผ๋Ÿฌ ๊ธฐ๋ฒ•

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ „๊ธฐยท์ปดํ“จํ„ฐ๊ณตํ•™๋ถ€, 2020. 8. ์ด์žฌ์ง„.Due to semiconductor technology scaling and near-threshold voltage computing, soft error resilience has become more important. Nowadays, GPUs are widely used in high performance computing (HPC) because of its efficient parallel processing and modern GPUs designed for HPC use error correction code (ECC) to protect their storage including register files. However, adopting ECC in the register file imposes high area and energy overhead. To replace the expensive hardware cost of ECC, we propose Penny, a lightweight compiler-directed resilience scheme for GPU register file protection. We combine recent advances in idempotent recovery with low-cost error detection code. Our approach focuses on solving two important problems: 1. Can we guarantee correct error recovery using idempotent execution with error detection code? We show that when an error detection code is used with idempotence recovery, certain restrictions required by previous idempotent recovery schemes are no longer needed. We also propose a software-based scheme to prevent the checkpoint value from being overwritten before the end of the region where the value is required for correct recovery. 2. How do we reduce the execution overhead caused by checkpointing? In GPUs additional checkpointing store instructions inflicts considerably higher overhead compared to CPUs, due to its architectural characteristics, such as lack of store buffers. We propose a number of compiler optimizations techniques that significantly reduce the overhead.๋ฐ˜๋„์ฒด ๋ฏธ์„ธ๊ณต์ • ๊ธฐ์ˆ ์ด ๋ฐœ์ „ํ•˜๊ณ  ๋ฌธํ„ฑ์ „์•• ๊ทผ์ฒ˜ ์ปดํ“จํŒ…(near-threashold voltage computing)์ด ๋„์ž…๋จ์— ๋”ฐ๋ผ์„œ ์†Œํ”„ํŠธ ์—๋Ÿฌ๋กœ๋ถ€ํ„ฐ์˜ ๋ณต์›์ด ์ค‘์š”ํ•œ ๊ณผ์ œ๊ฐ€ ๋˜์—ˆ๋‹ค. ๊ฐ•๋ ฅํ•œ ๋ณ‘๋ ฌ ๊ณ„์‚ฐ ์„ฑ๋Šฅ์„ ์ง€๋‹Œ GPU๋Š” ๊ณ ์„ฑ๋Šฅ ์ปดํ“จํŒ…์—์„œ ์ค‘์š”ํ•œ ์œ„์น˜๋ฅผ ์ฐจ์ง€ํ•˜๊ฒŒ ๋˜์—ˆ๊ณ , ์Šˆํผ ์ปดํ“จํ„ฐ์—์„œ ์“ฐ์ด๋Š” GPU๋“ค์€ ์—๋Ÿฌ ๋ณต์› ์ฝ”๋“œ์ธ ECC๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ ˆ์ง€์Šคํ„ฐ ํŒŒ์ผ ๋ฐ ๋ฉ”๋ชจ๋ฆฌ ๋“ฑ์— ์ €์žฅ๋œ ๋ฐ์ดํ„ฐ๋ฅผ ๋ณดํ˜ธํ•˜๊ฒŒ ๋˜์—ˆ๋‹ค. ํ•˜์ง€๋งŒ ๋ ˆ์ง€์Šคํ„ฐ ํŒŒ์ผ์— ECC๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์€ ํฐ ํ•˜๋“œ์›จ์–ด๋‚˜ ์—๋„ˆ์ง€ ๋น„์šฉ์„ ํ•„์š”๋กœ ํ•œ๋‹ค. ์ด๋Ÿฐ ๊ฐ’๋น„์‹ผ ECC์˜ ํ•˜๋“œ์›จ์–ด ๋น„์šฉ์„ ์ค„์ด๊ธฐ ์œ„ํ•ด ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์ปดํŒŒ์ผ๋Ÿฌ ๊ธฐ๋ฐ˜์˜ ์ €๋น„์šฉ GPU ๋ ˆ์ง€์Šคํ„ฐ ํŒŒ์ผ ๋ณต์› ๊ธฐ๋ฒ•์ธ Penny๋ฅผ ์ œ์•ˆํ•œ๋‹ค. ์ด๋Š” ์ตœ์‹ ์˜ ๋ฉฑ๋“ฑ์„ฑ(idempotency) ๊ธฐ๋ฐ˜ ์—๋Ÿฌ ๋ณต์› ๊ธฐ๋ฒ•์„ ์ €๋น„์šฉ์˜ ์—๋Ÿฌ ๊ฒ€์ถœ ์ฝ”๋“œ(EDC)์™€ ๊ฒฐํ•ฉํ•œ ๊ฒƒ์ด๋‹ค. ๋ณธ ๋…ผ๋ฌธ์€ ๋‹ค์Œ ๋‘๊ฐ€์ง€ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๋Š” ๋ฐ์— ์ง‘์ค‘ํ•œ๋‹ค. 1. ์—๋Ÿฌ ๊ฒ€์ถœ ์ฝ”๋“œ ๊ธฐ๋ฐ˜์œผ๋กœ ๋ฉฑ๋“ฑ์„ฑ ๊ธฐ๋ฐ˜ ์—๋Ÿฌ ๋ณต์›์„ ์‚ฌ์šฉ์‹œ ์†Œํ”„ํŠธ ์—๋Ÿฌ๋กœ๋ถ€ํ„ฐ์˜ ์•ˆ์ „ํ•œ ๋ณต์›์„ ๋ณด์žฅํ•  ์ˆ˜ ์žˆ๋Š”๊ฐ€?} ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์—๋Ÿฌ ๊ฒ€์ถœ ์ฝ”๋“œ๊ฐ€ ๋ฉฑ๋“ฑ์„ฑ ๊ธฐ๋ฐ˜ ๋ณต์› ๊ธฐ์ˆ ๊ณผ ๊ฐ™์ด ์‚ฌ์šฉ๋˜์—ˆ์„ ๊ฒฝ์šฐ ๊ธฐ์กด์˜ ๋ณต์› ๊ธฐ๋ฒ•์—์„œ ํ•„์š”๋กœ ํ–ˆ๋˜ ์กฐ๊ฑด๋“ค ์—†์ด๋„ ์•ˆ์ „ํ•˜๊ฒŒ ์—๋Ÿฌ๋กœ๋ถ€ํ„ฐ ๋ณต์›ํ•  ์ˆ˜ ์žˆ์Œ์„ ๋ณด์ธ๋‹ค. 2. ์ฒดํฌํฌ์ธํŒ…์—๋“œ๋Š” ๋น„์šฉ์„ ์–ด๋–ป๊ฒŒ ์ ˆ๊ฐํ•  ์ˆ˜ ์žˆ๋Š”๊ฐ€?} GPU๋Š” ์Šคํ† ์–ด ๋ฒ„ํผ๊ฐ€ ์—†๋Š” ๋“ฑ ์•„ํ‚คํ…์ณ์ ์ธ ํŠน์„ฑ์œผ๋กœ ์ธํ•ด์„œ CPU์™€ ๋น„๊ตํ•˜์—ฌ ์ฒดํฌํฌ์ธํŠธ ๊ฐ’์„ ์ €์žฅํ•˜๋Š” ๋ฐ์— ํฐ ์˜ค๋ฒ„ํ—ค๋“œ๊ฐ€ ๋“ ๋‹ค. ์ด ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ๋‹ค์–‘ํ•œ ์ปดํŒŒ์ผ๋Ÿฌ ์ตœ์ ํ™” ๊ธฐ๋ฒ•์„ ํ†ตํ•˜์—ฌ ์˜ค๋ฒ„ํ—ค๋“œ๋ฅผ ์ค„์ธ๋‹ค.1 Introduction 1 1.1 Why is Soft Error Resilience Important in GPUs 1 1.2 How can the ECC Overhead be Reduced 3 1.3 What are the Challenges 4 1.4 How do We Solve the Challenges 5 2 Comparison of Error Detection and Correction Coding Schemes for Register File Protection 7 2.1 Error Correction Codes and Error Detection Codes 8 2.2 Cost of Coding Schemes 9 2.3 Soft Error Frequency of GPUs 11 3 Idempotent Recovery and Challenges 13 3.1 Idempotent Execution 13 3.2 Previous Idempotent Schemes 13 3.2.1 De Kruijf's Idempotent Translation 14 3.2.2 Bolts's Idempotent Recovery 15 3.2.3 Comparison between Idempotent Schemes 15 3.3 Idempotent Recovery Process 17 3.4 Idempotent Recovery Challenges for GPUs 18 3.4.1 Checkpoint Overwriting 20 3.4.2 Performance Overhead 20 4 Correctness of Recovery 22 4.1 Proof of Safe Recovery 23 4.1.1 Prevention of Error Propagation 23 4.1.2 Proof of Correct State Recovery 24 4.1.3 Correctness in Multi-Threaded Execution 28 4.2 Preventing Checkpoint Overwriting 30 4.2.1 Register renaming 31 4.2.2 Storage Alternation by Checkpoint Coloring 33 4.2.3 Automatic Algorithm Selection 38 4.2.4 Future Works 38 5 Performance Optimizations 40 5.1 Compilation Phases of Penny 40 5.1.1 Region Formation 41 5.1.2 Bimodal Checkpoint Placement 41 5.1.3 Storage Alternation 42 5.1.4 Checkpoint Pruning 43 5.1.5 Storage Assignment 44 5.1.6 Code Generation and Low-level Optimizations 45 5.2 Cost Estimation Model 45 5.3 Region Formation 46 5.3.1 De Kruijf's Heuristic Region Formation 46 5.3.2 Region splitting and Region Stitching 47 5.3.3 Checkpoint-Cost Aware Optimal Region Formation 48 5.4 Bimodal Checkpoint Placement 52 5.5 Optimal Checkpoint Pruning 55 5.5.1 Bolt's Naive Pruning Algorithm and Overview of Penny's Optimal Pruning Algorithm 55 5.5.2 Phase 1: Collecting Global-Decision Independent Status 56 5.5.3 Phase2: Ordering and Finalizing Renaming Decisions 60 5.5.4 Effectiveness of Eliminating the Checkpoints 63 5.6 Automatic Checkpoint Storage Assignment 69 5.7 Low-Level Optimizations and Code Generation 70 6 Evaluation 74 6.1 Test Environment 74 6.1.1 GPU Architecture and Simulation Setup 74 6.1.2 Tested Applications 75 6.1.3 Register Assignment 76 6.2 Performance Evaluation 77 6.2.1 Overall Performance Overheads 77 6.2.2 Impact of Penny's Optimizations 78 6.2.3 Assigning Checkpoint Storage and Its Integrity 79 6.2.4 Impact of Optimal Checkpoint Pruning 80 6.2.5 Impact of Alias Analysis 81 6.3 Repurposing the Saved ECC Area 82 6.4 Energy Impact on Execution 83 6.5 Performance Overhead on Volta Architecture 85 6.6 Compilation Time 85 7 RelatedWorks 87 8 Conclusion and Future Works 89 8.1 Limitation and Future Work 90Docto

    Characterization and Acceleration of High Performance Compute Workloads

    Get PDF
    • โ€ฆ
    corecore