1,510 research outputs found

    MPGemmFI: A Fault Injection Technique for Mixed Precision GEMM in ML Applications

    Full text link
    Emerging deep learning workloads urgently need fast general matrix multiplication (GEMM). To meet such demand, one of the critical features of machine-learning-specific accelerators such as NVIDIA Tensor Cores, AMD Matrix Cores, and Google TPUs is the support of mixed-precision enabled GEMM. For DNN models, lower-precision FP data formats and computation offer acceptable correctness but significant performance, area, and memory footprint improvement. While promising, the mixed-precision computation on error resilience remains unexplored. To this end, we develop a fault injection framework that systematically injects fault into the mixed-precision computation results. We investigate how the faults affect the accuracy of machine learning applications. Based on the error resilience characteristics, we offer lightweight error detection and correction solutions that significantly improve the overall model accuracy if the models experience hardware faults. The solutions can be efficiently integrated into the accelerator's pipelines

    Extensions of Task-based Runtime for High Performance Dense Linear Algebra Applications

    Get PDF
    On the road to exascale computing, the gap between hardware peak performance and application performance is increasing as system scale, chip density and inherent complexity of modern supercomputers are expanding. Even if we put aside the difficulty to express algorithmic parallelism and to efficiently execute applications at large scale, other open questions remain. The ever-growing scale of modern supercomputers induces a fast decline of the Mean Time To Failure. A generic, low-overhead, resilient extension becomes a desired aptitude for any programming paradigm. This dissertation addresses these two critical issues, designing an efficient unified linear algebra development environment using a task-based runtime, and extending a task-based runtime with fault tolerant capabilities to build a generic framework providing both soft and hard error resilience to task-based programming paradigm. To bridge the gap between hardware peak performance and application perfor- mance, a unified programming model is designed to take advantage of a lightweight task-based runtime to manage the resource-specific workload, and to control the data ow and parallel execution of tasks. Under this unified development, linear algebra tasks are abstracted across different underlying heterogeneous resources, including multicore CPUs, GPUs and Intel Xeon Phi coprocessors. Performance portability is guaranteed and this programming model is adapted to a wide range of accelerators, supporting both shared and distributed-memory environments. To solve the resilient challenges on large scale systems, fault tolerant mechanisms are designed for a task-based runtime to protect applications against both soft and hard errors. For soft errors, three additions to a task-based runtime are explored. The first recovers the application by re-executing minimum number of tasks, the second logs intermediary data between tasks to minimize the necessary re-execution, while the last one takes advantage of algorithmic properties to recover the data without re- execution. For hard errors, we propose two generic approaches, which augment the data logging mechanism for soft errors. The first utilizes non-volatile storage device to save logged data, while the second saves local logged data on a remote node to protect against node failure. Experimental results have confirmed that our soft and hard error fault tolerant mechanisms exhibit the expected correctness and efficiency

    Next Generation Cloud Computing: New Trends and Research Directions

    Get PDF
    The landscape of cloud computing has significantly changed over the last decade. Not only have more providers and service offerings crowded the space, but also cloud infrastructure that was traditionally limited to single provider data centers is now evolving. In this paper, we firstly discuss the changing cloud infrastructure and consider the use of infrastructure from multiple providers and the benefit of decentralising computing away from data centers. These trends have resulted in the need for a variety of new computing architectures that will be offered by future cloud infrastructure. These architectures are anticipated to impact areas, such as connecting people and devices, data-intensive computing, the service space and self-learning systems. Finally, we lay out a roadmap of challenges that will need to be addressed for realising the potential of next generation cloud systems.Comment: Accepted to Future Generation Computer Systems, 07 September 201

    GPU ์—๋Ÿฌ ์•ˆ์ •์„ฑ ๋ณด์žฅ์„ ์œ„ํ•œ ์ปดํŒŒ์ผ๋Ÿฌ ๊ธฐ๋ฒ•

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ „๊ธฐยท์ปดํ“จํ„ฐ๊ณตํ•™๋ถ€, 2020. 8. ์ด์žฌ์ง„.Due to semiconductor technology scaling and near-threshold voltage computing, soft error resilience has become more important. Nowadays, GPUs are widely used in high performance computing (HPC) because of its efficient parallel processing and modern GPUs designed for HPC use error correction code (ECC) to protect their storage including register files. However, adopting ECC in the register file imposes high area and energy overhead. To replace the expensive hardware cost of ECC, we propose Penny, a lightweight compiler-directed resilience scheme for GPU register file protection. We combine recent advances in idempotent recovery with low-cost error detection code. Our approach focuses on solving two important problems: 1. Can we guarantee correct error recovery using idempotent execution with error detection code? We show that when an error detection code is used with idempotence recovery, certain restrictions required by previous idempotent recovery schemes are no longer needed. We also propose a software-based scheme to prevent the checkpoint value from being overwritten before the end of the region where the value is required for correct recovery. 2. How do we reduce the execution overhead caused by checkpointing? In GPUs additional checkpointing store instructions inflicts considerably higher overhead compared to CPUs, due to its architectural characteristics, such as lack of store buffers. We propose a number of compiler optimizations techniques that significantly reduce the overhead.๋ฐ˜๋„์ฒด ๋ฏธ์„ธ๊ณต์ • ๊ธฐ์ˆ ์ด ๋ฐœ์ „ํ•˜๊ณ  ๋ฌธํ„ฑ์ „์•• ๊ทผ์ฒ˜ ์ปดํ“จํŒ…(near-threashold voltage computing)์ด ๋„์ž…๋จ์— ๋”ฐ๋ผ์„œ ์†Œํ”„ํŠธ ์—๋Ÿฌ๋กœ๋ถ€ํ„ฐ์˜ ๋ณต์›์ด ์ค‘์š”ํ•œ ๊ณผ์ œ๊ฐ€ ๋˜์—ˆ๋‹ค. ๊ฐ•๋ ฅํ•œ ๋ณ‘๋ ฌ ๊ณ„์‚ฐ ์„ฑ๋Šฅ์„ ์ง€๋‹Œ GPU๋Š” ๊ณ ์„ฑ๋Šฅ ์ปดํ“จํŒ…์—์„œ ์ค‘์š”ํ•œ ์œ„์น˜๋ฅผ ์ฐจ์ง€ํ•˜๊ฒŒ ๋˜์—ˆ๊ณ , ์Šˆํผ ์ปดํ“จํ„ฐ์—์„œ ์“ฐ์ด๋Š” GPU๋“ค์€ ์—๋Ÿฌ ๋ณต์› ์ฝ”๋“œ์ธ ECC๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ ˆ์ง€์Šคํ„ฐ ํŒŒ์ผ ๋ฐ ๋ฉ”๋ชจ๋ฆฌ ๋“ฑ์— ์ €์žฅ๋œ ๋ฐ์ดํ„ฐ๋ฅผ ๋ณดํ˜ธํ•˜๊ฒŒ ๋˜์—ˆ๋‹ค. ํ•˜์ง€๋งŒ ๋ ˆ์ง€์Šคํ„ฐ ํŒŒ์ผ์— ECC๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์€ ํฐ ํ•˜๋“œ์›จ์–ด๋‚˜ ์—๋„ˆ์ง€ ๋น„์šฉ์„ ํ•„์š”๋กœ ํ•œ๋‹ค. ์ด๋Ÿฐ ๊ฐ’๋น„์‹ผ ECC์˜ ํ•˜๋“œ์›จ์–ด ๋น„์šฉ์„ ์ค„์ด๊ธฐ ์œ„ํ•ด ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์ปดํŒŒ์ผ๋Ÿฌ ๊ธฐ๋ฐ˜์˜ ์ €๋น„์šฉ GPU ๋ ˆ์ง€์Šคํ„ฐ ํŒŒ์ผ ๋ณต์› ๊ธฐ๋ฒ•์ธ Penny๋ฅผ ์ œ์•ˆํ•œ๋‹ค. ์ด๋Š” ์ตœ์‹ ์˜ ๋ฉฑ๋“ฑ์„ฑ(idempotency) ๊ธฐ๋ฐ˜ ์—๋Ÿฌ ๋ณต์› ๊ธฐ๋ฒ•์„ ์ €๋น„์šฉ์˜ ์—๋Ÿฌ ๊ฒ€์ถœ ์ฝ”๋“œ(EDC)์™€ ๊ฒฐํ•ฉํ•œ ๊ฒƒ์ด๋‹ค. ๋ณธ ๋…ผ๋ฌธ์€ ๋‹ค์Œ ๋‘๊ฐ€์ง€ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๋Š” ๋ฐ์— ์ง‘์ค‘ํ•œ๋‹ค. 1. ์—๋Ÿฌ ๊ฒ€์ถœ ์ฝ”๋“œ ๊ธฐ๋ฐ˜์œผ๋กœ ๋ฉฑ๋“ฑ์„ฑ ๊ธฐ๋ฐ˜ ์—๋Ÿฌ ๋ณต์›์„ ์‚ฌ์šฉ์‹œ ์†Œํ”„ํŠธ ์—๋Ÿฌ๋กœ๋ถ€ํ„ฐ์˜ ์•ˆ์ „ํ•œ ๋ณต์›์„ ๋ณด์žฅํ•  ์ˆ˜ ์žˆ๋Š”๊ฐ€?} ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์—๋Ÿฌ ๊ฒ€์ถœ ์ฝ”๋“œ๊ฐ€ ๋ฉฑ๋“ฑ์„ฑ ๊ธฐ๋ฐ˜ ๋ณต์› ๊ธฐ์ˆ ๊ณผ ๊ฐ™์ด ์‚ฌ์šฉ๋˜์—ˆ์„ ๊ฒฝ์šฐ ๊ธฐ์กด์˜ ๋ณต์› ๊ธฐ๋ฒ•์—์„œ ํ•„์š”๋กœ ํ–ˆ๋˜ ์กฐ๊ฑด๋“ค ์—†์ด๋„ ์•ˆ์ „ํ•˜๊ฒŒ ์—๋Ÿฌ๋กœ๋ถ€ํ„ฐ ๋ณต์›ํ•  ์ˆ˜ ์žˆ์Œ์„ ๋ณด์ธ๋‹ค. 2. ์ฒดํฌํฌ์ธํŒ…์—๋“œ๋Š” ๋น„์šฉ์„ ์–ด๋–ป๊ฒŒ ์ ˆ๊ฐํ•  ์ˆ˜ ์žˆ๋Š”๊ฐ€?} GPU๋Š” ์Šคํ† ์–ด ๋ฒ„ํผ๊ฐ€ ์—†๋Š” ๋“ฑ ์•„ํ‚คํ…์ณ์ ์ธ ํŠน์„ฑ์œผ๋กœ ์ธํ•ด์„œ CPU์™€ ๋น„๊ตํ•˜์—ฌ ์ฒดํฌํฌ์ธํŠธ ๊ฐ’์„ ์ €์žฅํ•˜๋Š” ๋ฐ์— ํฐ ์˜ค๋ฒ„ํ—ค๋“œ๊ฐ€ ๋“ ๋‹ค. ์ด ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ๋‹ค์–‘ํ•œ ์ปดํŒŒ์ผ๋Ÿฌ ์ตœ์ ํ™” ๊ธฐ๋ฒ•์„ ํ†ตํ•˜์—ฌ ์˜ค๋ฒ„ํ—ค๋“œ๋ฅผ ์ค„์ธ๋‹ค.1 Introduction 1 1.1 Why is Soft Error Resilience Important in GPUs 1 1.2 How can the ECC Overhead be Reduced 3 1.3 What are the Challenges 4 1.4 How do We Solve the Challenges 5 2 Comparison of Error Detection and Correction Coding Schemes for Register File Protection 7 2.1 Error Correction Codes and Error Detection Codes 8 2.2 Cost of Coding Schemes 9 2.3 Soft Error Frequency of GPUs 11 3 Idempotent Recovery and Challenges 13 3.1 Idempotent Execution 13 3.2 Previous Idempotent Schemes 13 3.2.1 De Kruijf's Idempotent Translation 14 3.2.2 Bolts's Idempotent Recovery 15 3.2.3 Comparison between Idempotent Schemes 15 3.3 Idempotent Recovery Process 17 3.4 Idempotent Recovery Challenges for GPUs 18 3.4.1 Checkpoint Overwriting 20 3.4.2 Performance Overhead 20 4 Correctness of Recovery 22 4.1 Proof of Safe Recovery 23 4.1.1 Prevention of Error Propagation 23 4.1.2 Proof of Correct State Recovery 24 4.1.3 Correctness in Multi-Threaded Execution 28 4.2 Preventing Checkpoint Overwriting 30 4.2.1 Register renaming 31 4.2.2 Storage Alternation by Checkpoint Coloring 33 4.2.3 Automatic Algorithm Selection 38 4.2.4 Future Works 38 5 Performance Optimizations 40 5.1 Compilation Phases of Penny 40 5.1.1 Region Formation 41 5.1.2 Bimodal Checkpoint Placement 41 5.1.3 Storage Alternation 42 5.1.4 Checkpoint Pruning 43 5.1.5 Storage Assignment 44 5.1.6 Code Generation and Low-level Optimizations 45 5.2 Cost Estimation Model 45 5.3 Region Formation 46 5.3.1 De Kruijf's Heuristic Region Formation 46 5.3.2 Region splitting and Region Stitching 47 5.3.3 Checkpoint-Cost Aware Optimal Region Formation 48 5.4 Bimodal Checkpoint Placement 52 5.5 Optimal Checkpoint Pruning 55 5.5.1 Bolt's Naive Pruning Algorithm and Overview of Penny's Optimal Pruning Algorithm 55 5.5.2 Phase 1: Collecting Global-Decision Independent Status 56 5.5.3 Phase2: Ordering and Finalizing Renaming Decisions 60 5.5.4 Effectiveness of Eliminating the Checkpoints 63 5.6 Automatic Checkpoint Storage Assignment 69 5.7 Low-Level Optimizations and Code Generation 70 6 Evaluation 74 6.1 Test Environment 74 6.1.1 GPU Architecture and Simulation Setup 74 6.1.2 Tested Applications 75 6.1.3 Register Assignment 76 6.2 Performance Evaluation 77 6.2.1 Overall Performance Overheads 77 6.2.2 Impact of Penny's Optimizations 78 6.2.3 Assigning Checkpoint Storage and Its Integrity 79 6.2.4 Impact of Optimal Checkpoint Pruning 80 6.2.5 Impact of Alias Analysis 81 6.3 Repurposing the Saved ECC Area 82 6.4 Energy Impact on Execution 83 6.5 Performance Overhead on Volta Architecture 85 6.6 Compilation Time 85 7 RelatedWorks 87 8 Conclusion and Future Works 89 8.1 Limitation and Future Work 90Docto

    High-Integrity GPU Designs for Critical Real-Time Automotive Systems

    Get PDF
    Autonomous Driving (AD) imposes the use of highperformance hardware, such as GPUs, to perform object recognition and tracking in real-time. However, differently to the consumer electronics market, critical real-time AD functionalities require a high degree of resilience against faults, in line with the automotive ISO26262 functional safety standard requirements. ISO26262 imposes the use of some source of independent redundancy for the most critical functionalities so that a single fault cannot lead to a failure, being dual core lockstep (DCLS) with diversity the preferred choice for computing devices. Unfortunately, GPUs do not support diverse DCLS by construction, thus failing to meet ISO26262 requirements efficiently. In this paper we propose lightweight modifications to GPUs to enable diverse DCLS for critical real-time applications without diminishing their performance for non-critical applications. In particular, we show how enabling specific mechanisms for software-controlled kernel scheduling in the GPU, allows guaranteeing that redundant kernels can be executed in different resources so that a single fault cannot lead to a failure, as imposed by ISO26262. Our results on a GPU simulator and an NVIDIA GPU prove the viability of the approach and its effectiveness on high-performance GPU designs needed for AD systems.This work has been partially supported by the Spanish Ministry of Economy and Competitiveness (MINECO) under grant TIN2015-65316-P and the HiPEAC Network of Excellence. Jaume Abella has been partially supported by the MINECO under Ramon y Cajal postdoctoral fellowship number RYC-2013-14717. Carles Hernandez is jointly funded by the MINECO and FEDER funds through grant TIN2014-60404-JIN.SรญPostprint (author's final draft

    Baybayin Character Instance Detection

    Full text link
    The Philippine Government recently passed the "National Writing System Act," which promotes using Baybayin in Philippine texts. In support of this effort to promote the use of Baybayin, we present a computer vision system which can aid individuals who cannot easily read Baybayin script. In this paper, we survey the existing methods of identifying Baybayin scripts using computer vision and machine learning techniques and discuss their capabilities and limitations. Further, we propose a Baybayin Optical Character Instance Segmentation and Classification model using state-of-the-art Convolutional Neural Networks (CNNs) that detect Baybayin character instances in an image then outputs the Latin alphabet counterparts of each character instance in the image. Most existing systems are limited to character-level image classification and often misclassify or not natively support characters with diacritics. In addition, these existing models often have specific input requirements that limit it to classifying Baybayin text in a controlled setting, such as limitations in clarity and contrast, among others. To our knowledge, our proposed method is the first end-to-end character instance detection model for Baybayin, achieving a mAP50 score of 93.30%, mAP50-95 score of 80.50%, and F1-Score of 84.84%
    • โ€ฆ
    corecore