318 research outputs found

    An Experimental Study of Reduced-Voltage Operation in Modern FPGAs for Neural Network Acceleration

    Get PDF
    We empirically evaluate an undervolting technique, i.e., underscaling the circuit supply voltage below the nominal level, to improve the power-efficiency of Convolutional Neural Network (CNN) accelerators mapped to Field Programmable Gate Arrays (FPGAs). Undervolting below a safe voltage level can lead to timing faults due to excessive circuit latency increase. We evaluate the reliability-power trade-off for such accelerators. Specifically, we experimentally study the reduced-voltage operation of multiple components of real FPGAs, characterize the corresponding reliability behavior of CNN accelerators, propose techniques to minimize the drawbacks of reduced-voltage operation, and combine undervolting with architectural CNN optimization techniques, i.e., quantization and pruning. We investigate the effect of environmental temperature on the reliability-power trade-off of such accelerators. We perform experiments on three identical samples of modern Xilinx ZCU102 FPGA platforms with five state-of-the-art image classification CNN benchmarks. This approach allows us to study the effects of our undervolting technique for both software and hardware variability. We achieve more than 3X power-efficiency (GOPs/W) gain via undervolting. 2.6X of this gain is the result of eliminating the voltage guardband region, i.e., the safe voltage region below the nominal level that is set by FPGA vendor to ensure correct functionality in worst-case environmental and circuit conditions. 43% of the power-efficiency gain is due to further undervolting below the guardband, which comes at the cost of accuracy loss in the CNN accelerator. We evaluate an effective frequency underscaling technique that prevents this accuracy loss, and find that it reduces the power-efficiency gain from 43% to 25%.Comment: To appear at the DSN 2020 conferenc

    An experimental study of reduced-voltage operation in modern FPGAs for neural network acceleration

    Get PDF
    We empirically evaluate an undervolting technique, i.e., underscaling the circuit supply voltage below the nominal level, to improve the power-efficiency of Convolutional Neural Network (CNN) accelerators mapped to Field Programmable Gate Arrays (FPGAs). Undervolting below a safe voltage level can lead to timing faults due to excessive circuit latency increase. We evaluate the reliability-power trade-off for such accelerators. Specifically, we experimentally study the reduced-voltage operation of multiple components of real FPGAs, characterize the corresponding reliability behavior of CNN accelerators, propose techniques to minimize the drawbacks of reduced-voltage operation, and combine undervolting with architectural CNN optimization techniques, i.e., quantization and pruning. We investigate the effect ofenvironmental temperature on the reliability-power trade-off of such accelerators. We perform experiments on three identical samples of modern Xilinx ZCU102 FPGA platforms with five state-of-the-art image classification CNN benchmarks. This approach allows us to study the effects of our undervolting technique for both software and hardware variability. We achieve more than 3X power-efficiency (GOPs/W ) gain via undervolting. 2.6X of this gain is the result of eliminating the voltage guardband region, i.e., the safe voltage region below the nominal level that is set by FPGA vendor to ensure correct functionality in worst-case environmental and circuit conditions. 43% of the power-efficiency gain is due to further undervolting below the guardband, which comes at the cost of accuracy loss in the CNN accelerator. We evaluate an effective frequency underscaling technique that prevents this accuracy loss, and find that it reduces the power-efficiency gain from 43% to 25%.The work done for this paper was partially supported by a HiPEAC Collaboration Grant funded by the H2020 HiPEAC Project under grant agreement No. 779656. The research leading to these results has received funding from the European Union’s Horizon 2020 Programme under the LEGaTO Project (www.legato-project.eu), grant agreement No. 780681.Peer ReviewedPostprint (author's final draft

    Exploring Hardware Fault Impacts on Different Real Number Representations of the Structural Resilience of TCUs in GPUs

    Get PDF
    The most recent generations of graphics processing units (GPUs) boost the execution of convolutional operations required by machine learning applications by resorting to specialized and efficient in-chip accelerators (Tensor Core Units or TCUs) that operate on matrix multiplication tiles. Unfortunately, modern cutting-edge semiconductor technologies are increasingly prone to hardware defects, and the trend to highly stress TCUs during the execution of safety-critical and high-performance computing (HPC) applications increases the likelihood of TCUs producing different kinds of failures. In fact, the intrinsic resiliency to hardware faults of arithmetic units plays a crucial role in safety-critical applications using GPUs (e.g., in automotive, space, and autonomous robotics). Recently, new arithmetic formats have been proposed, particularly those suited to neural network execution. However, the reliability characterization of TCUs supporting different arithmetic formats was still lacking. In this work, we quantitatively assessed the impact of hardware faults in TCU structures while employing two distinct formats (floating-point and posit) and using two different configurations (16 and 32 bits) to represent real numbers. For the experimental evaluation, we resorted to an architectural description of a TCU core (PyOpenTCU) and performed 120 fault simulation campaigns, injecting around 200,000 faults per campaign and requiring around 32 days of computation. Our results demonstrate that the posit format of TCUs is less affected by faults than the floating-point one (by up to three orders of magnitude for 16 bits and up to twenty orders for 32 bits). We also identified the most sensible fault locations (i.e., those that produce the largest errors), thus paving the way to adopting smart hardening solutions

    Next Generation Cloud Computing: New Trends and Research Directions

    Get PDF
    The landscape of cloud computing has significantly changed over the last decade. Not only have more providers and service offerings crowded the space, but also cloud infrastructure that was traditionally limited to single provider data centers is now evolving. In this paper, we firstly discuss the changing cloud infrastructure and consider the use of infrastructure from multiple providers and the benefit of decentralising computing away from data centers. These trends have resulted in the need for a variety of new computing architectures that will be offered by future cloud infrastructure. These architectures are anticipated to impact areas, such as connecting people and devices, data-intensive computing, the service space and self-learning systems. Finally, we lay out a roadmap of challenges that will need to be addressed for realising the potential of next generation cloud systems.Comment: Accepted to Future Generation Computer Systems, 07 September 201

    A Modern Primer on Processing in Memory

    Full text link
    Modern computing systems are overwhelmingly designed to move data to computation. This design choice goes directly against at least three key trends in computing that cause performance, scalability and energy bottlenecks: (1) data access is a key bottleneck as many important applications are increasingly data-intensive, and memory bandwidth and energy do not scale well, (2) energy consumption is a key limiter in almost all computing platforms, especially server and mobile systems, (3) data movement, especially off-chip to on-chip, is very expensive in terms of bandwidth, energy and latency, much more so than computation. These trends are especially severely-felt in the data-intensive server and energy-constrained mobile systems of today. At the same time, conventional memory technology is facing many technology scaling challenges in terms of reliability, energy, and performance. As a result, memory system architects are open to organizing memory in different ways and making it more intelligent, at the expense of higher cost. The emergence of 3D-stacked memory plus logic, the adoption of error correcting codes inside the latest DRAM chips, proliferation of different main memory standards and chips, specialized for different purposes (e.g., graphics, low-power, high bandwidth, low latency), and the necessity of designing new solutions to serious reliability and security issues, such as the RowHammer phenomenon, are an evidence of this trend. This chapter discusses recent research that aims to practically enable computation close to data, an approach we call processing-in-memory (PIM). PIM places computation mechanisms in or near where the data is stored (i.e., inside the memory chips, in the logic layer of 3D-stacked memory, or in the memory controllers), so that data movement between the computation units and memory is reduced or eliminated.Comment: arXiv admin note: substantial text overlap with arXiv:1903.0398

    A performance, energy consumption and reliability evaluation of workload distribution on heterogeneous devices

    Get PDF
    The constant need of higher performances and reduced power consumption has lead vendors to design heterogeneous devices that embed traditional Central Process Unit (CPU) and an accelerator, like a Graphics Processing Unit (GPU) or Field-programmable Gate Array (FPGA). When the CPU and the accelerator are used collaboratively the device computational performances reach their peak. However, the higher amount of resources employed for computation has, potentially, the side effect of increasing soft error rate. This thesis evaluates the reliability behaviour of AMD Kaveri Accelerated Processing Units (APU) executing four heterogeneous applications, each one representing an algorithm class. The workload is gradually distributed from the CPU to the GPU and both the energy consumption and execution time are measured. Then, an accelerated neutron beam was used to measure the realistic error rates of the different workload distributions. Finally, we evaluate which configuration provides the lowest error rate or allows the computation of the highest amount of data before experiencing a failure. As is shown in this thesis, energy consumption and execution time are mold by the same trend while error rates highly depend on algorithm class and workload distribution. Additionally, we show that, in most cases, the most reliable workload distribution is the one that delivers the highest performances. As experimentally proven, by choosing the correct workload distribution the device reliability can increase of up to 90x.A constante necessidade de maior desempenho e menor consumo de energia levou aos fabricantes a projetar dispositivos heterogêneos que incorporam uma Unidade Central de Processameno (CPU) tradicional e um acelerador, como uma Unidade de Processamento Gráfico (GPU) ou um Arranjo de Portas Programáveis em Campo (FPGA). Quando a CPU e o acelerador são usados de forma colaborativa, o desempenho computacional do dispositivo atinge seu pico. No entanto, a maior quantidade de recursos empregados para o cálculo tem, potencialmente, o efeito colateral de aumentar a taxa de erros. Esta tese avalia a confiabilidade das AMD Kaveri "Accelerated Processing Units"(APUs) executando quatro aplicações heterogêneas, cada uma representando uma classe de algoritmos. A carga de trabalho é gradualmente distribuída da CPU para a GPU e o consumo de energia e o tempo de execução são medidos. Em seguida, um feixe de neutrões é utilizado para medir as taxas de erro reais das diferentes distribuições de carga de trabalho. Por fim, avalia-se qual configuração fornece a menor taxa de erro ou permite o cálculo da maior quantidade de dados antes de ocorrer uma falha. Como é mostrado nesta tese, o consumo de energia e o tempo de execução são moldados pela mesma tendência, enquanto as taxas de erro dependem da classe de algoritmos e da distribuição da carga de trabalho. Além disso, é mostrado que, na maioria dos casos, a distribuição de carga de trabalho mais confiável é a que fornece o maior desempenho. Como comprovado experimentalmente, ao escolher a distribuição de carga de trabalho correta, a confiabilidade do dispositivo pode aumentar até 9 vezes

    Challenges and Opportunities in Near-Threshold DNN Accelerators around Timing Errors

    Get PDF
    AI evolution is accelerating and Deep Neural Network (DNN) inference accelerators are at the forefront of ad hoc architectures that are evolving to support the immense throughput required for AI computation. However, much more energy efficient design paradigms are inevitable to realize the complete potential of AI evolution and curtail energy consumption. The Near-Threshold Computing (NTC) design paradigm can serve as the best candidate for providing the required energy efficiency. However, NTC operation is plagued with ample performance and reliability concerns arising from the timing errors. In this paper, we dive deep into DNN architecture to uncover some unique challenges and opportunities for operation in the NTC paradigm. By performing rigorous simulations in TPU systolic array, we reveal the severity of timing errors and its impact on inference accuracy at NTC. We analyze various attributes—such as data–delay relationship, delay disparity within arithmetic units, utilization pattern, hardware homogeneity, workload characteristics—and uncover unique localized and global techniques to deal with the timing errors in NTC
    • …
    corecore