27 research outputs found

    Approximate Computing Survey, Part I: Terminology and Software & Hardware Approximation Techniques

    Full text link
    The rapid growth of demanding applications in domains applying multimedia processing and machine learning has marked a new era for edge and cloud computing. These applications involve massive data and compute-intensive tasks, and thus, typical computing paradigms in embedded systems and data centers are stressed to meet the worldwide demand for high performance. Concurrently, the landscape of the semiconductor field in the last 15 years has constituted power as a first-class design concern. As a result, the community of computing systems is forced to find alternative design approaches to facilitate high-performance and/or power-efficient computing. Among the examined solutions, Approximate Computing has attracted an ever-increasing interest, with research works applying approximations across the entire traditional computing stack, i.e., at software, hardware, and architectural levels. Over the last decade, there is a plethora of approximation techniques in software (programs, frameworks, compilers, runtimes, languages), hardware (circuits, accelerators), and architectures (processors, memories). The current article is Part I of our comprehensive survey on Approximate Computing, and it reviews its motivation, terminology and principles, as well it classifies and presents the technical details of the state-of-the-art software and hardware approximation techniques.Comment: Under Review at ACM Computing Survey

    Clúster de Computación Científica de Bajo Coste y Consumo

    Get PDF
    En este trabajo se presenta la construcción de un clúster basado en FPGAs de gama baja, capaz de ejecutar programas muy intensivos en datos en el mismo o en menor tiempo que una estación de trabajo con 56 cores lógicos, de mucho mayor coste y consumo. Se ha generado una imagen personalizada basada en Debian 8 y se ha instalado en ella el software necesario para poder ejecutar códigos escritos en OpenCL y compilados con el Kit de desarrollo de software de Intel para FPGAs. Además, se ha realizado una comparativa de los tiempos de ejecución, coste y consumo energético, cuyo resultado ha sido que el clúster es casi 6 veces más barato y es capaz de obtener reducciones energéticas del 83%.This paper presents the construction of a cluster based on low-end FPGAs, capable of executing very data-intensive programs in the same or less time than a workstation with 56 logical cores, with much higher cost and consumption. A custom image based on Debian 8 has been generated and the software necessary to execute codes written in OpenCL and compiled with the Intel Software Development Kit for FPGAs has been installed. In addition, a comparison has been made of the execution times, cost and energy consumption, and the results have been that the cluster is almost 6 times cheaper and is able to obtain energy reductions of 83%.Universidad de Granada: Departamento de Arquitectura y Tecnología de Computadore

    Neural Network Accelerator Design for System on Chip

    Get PDF
    ML is vastly utilized in a variety of applications such as voice recognition, computer vision, image classification, object detection, and plenty of other use cases. Protecting data privacy and the importance of preventing latency in different applications and saving the network bandwidth to process data locally without the need to transfer it to the cloud. The approach is called edge computing. It is challenging to design a deep learning accelerator suitable for edge devices. Two main factors affect the chip design. On-chip memory is the first and the most power and area consuming unit. The second one is multipliers. In this thesis, we are focusing on the latter. Most machine learning algorithms use convolution, which is calculated by multiplying and accumulating input feature maps and weights. Most of the deep learning accelerators use the precision scalable Multiply and Accumulate (MAC) architecture and an array of MAC units. Most of the chip’s area is taken up by the array of MAC units, especially multipliers, which also use a lot of power. This master’s thesis consists of two parts. First, a new deep learning accelerator architecture is proposed. Second, different multiplier algorithms are explored. These algorithms were implemented in the SystemVerilog language and synthesized via Cadence tools. The aim was to find a smaller area and lower power consumption multiplier with higher performance. In this work, the Braun multiplier, the Booth multiplier, the Baugh-Wooley array multiplier, the Wallace multiplier, the Parallel prefix Vedic multiplier, and the Modified-Booth multiplier are implemented. The power consumption, chip area usage, and performance of the multipliers at different clock frequencies are measured and considered to select the optimal multiplier. Then the precision flexibility feature is added to the selected multiplier algorithms to perform one 8-bit*8-bit, two 4-bit*4-bit, or four 2-bit*2-bit multiplication. It is worth mentioning that both data (multiplicand) and weight (multiplier) can be in different bit width ranges, such as 1,2,4,8. In the proposed deep learning accelerator, the area and power of the systolic array are measured and reported. Among all other multipliers, the signed flexible Modified Booth multiplier which can calculate 2,4, and 8-bits is selected. It occupies 866.461 um2, and consumes 0.653 mW power at 1 GHz. The area and power of the systolic array with bit precision flexible are 283,564 mm2 and 223,156 mW power at 1 GHz, respectively

    Rethinking FPGA Architectures for Deep Neural Network applications

    Get PDF
    The prominence of machine learning-powered solutions instituted an unprecedented trend of integration into virtually all applications with a broad range of deployment constraints from tiny embedded systems to large-scale warehouse computing machines. While recent research confirms the edges of using contemporary FPGAs to deploy or accelerate machine learning applications, especially where the latency and energy consumption are strictly limited, their pre-machine learning optimised architectures remain a barrier to the overall efficiency and performance. Realizing this shortcoming, this thesis demonstrates an architectural study aiming at solutions that enable hidden potentials in the FPGA technology, primarily for machine learning algorithms. Particularly, it shows how slight alterations to the state-of-the-art architectures could significantly enhance the FPGAs toward becoming more machine learning-friendly while maintaining the near-promised performance for the rest of the applications. Eventually, it presents a novel systematic approach to deriving new block architectures guided by designing limitations and machine learning algorithm characteristics through benchmarking. First, through three modifications to Xilinx DSP48E2 blocks, an enhanced digital signal processing (DSP) block for important computations in embedded deep neural network (DNN) accelerators is described. Then, two tiers of modifications to FPGA logic cell architecture are explained that deliver a variety of performance and utilisation benefits with only minor area overheads. Eventually, with the goal of exploring this new design space in a methodical manner, a problem formulation involving computing nested loops over multiply-accumulate (MAC) operations is first proposed. A quantitative methodology for deriving efficient coarse-grained compute block architectures from benchmarks is then suggested together with a family of new embedded blocks, called MLBlocks

    Proceedings of the 21st Conference on Formal Methods in Computer-Aided Design – FMCAD 2021

    Get PDF
    The Conference on Formal Methods in Computer-Aided Design (FMCAD) is an annual conference on the theory and applications of formal methods in hardware and system verification. FMCAD provides a leading forum to researchers in academia and industry for presenting and discussing groundbreaking methods, technologies, theoretical results, and tools for reasoning formally about computing systems. FMCAD covers formal aspects of computer-aided system design including verification, specification, synthesis, and testing

    Energy efficient enabling technologies for semantic video processing on mobile devices

    Get PDF
    Semantic object-based processing will play an increasingly important role in future multimedia systems due to the ubiquity of digital multimedia capture/playback technologies and increasing storage capacity. Although the object based paradigm has many undeniable benefits, numerous technical challenges remain before the applications becomes pervasive, particularly on computational constrained mobile devices. A fundamental issue is the ill-posed problem of semantic object segmentation. Furthermore, on battery powered mobile computing devices, the additional algorithmic complexity of semantic object based processing compared to conventional video processing is highly undesirable both from a real-time operation and battery life perspective. This thesis attempts to tackle these issues by firstly constraining the solution space and focusing on the human face as a primary semantic concept of use to users of mobile devices. A novel face detection algorithm is proposed, which from the outset was designed to be amenable to be offloaded from the host microprocessor to dedicated hardware, thereby providing real-time performance and reducing power consumption. The algorithm uses an Artificial Neural Network (ANN), whose topology and weights are evolved via a genetic algorithm (GA). The computational burden of the ANN evaluation is offloaded to a dedicated hardware accelerator, which is capable of processing any evolved network topology. Efficient arithmetic circuitry, which leverages modified Booth recoding, column compressors and carry save adders, is adopted throughout the design. To tackle the increased computational costs associated with object tracking or object based shape encoding, a novel energy efficient binary motion estimation architecture is proposed. Energy is reduced in the proposed motion estimation architecture by minimising the redundant operations inherent in the binary data. Both architectures are shown to compare favourable with the relevant prior art

    Tuning the Computational Effort: An Adaptive Accuracy-aware Approach Across System Layers

    Get PDF
    This thesis introduces a novel methodology to realize accuracy-aware systems, which will help designers integrate accuracy awareness into their systems. It proposes an adaptive accuracy-aware approach across system layers that addresses current challenges in that domain, combining and tuning accuracy-aware methods on different system layers. To widen the scope of accuracy-aware computing including approximate computing for other domains, this thesis presents innovative accuracy-aware methods and techniques for different system layers. The required tuning of the accuracy-aware methods is integrated into a configuration layer that tunes the available knobs of the accuracy-aware methods integrated into a system

    Proceedings of the 5th International Workshop on Reconfigurable Communication-centric Systems on Chip 2010 - ReCoSoC\u2710 - May 17-19, 2010 Karlsruhe, Germany. (KIT Scientific Reports ; 7551)

    Get PDF
    ReCoSoC is intended to be a periodic annual meeting to expose and discuss gathered expertise as well as state of the art research around SoC related topics through plenary invited papers and posters. The workshop aims to provide a prospective view of tomorrow\u27s challenges in the multibillion transistor era, taking into account the emerging techniques and architectures exploring the synergy between flexible on-chip communication and system reconfigurability

    Uncertainty in Artificial Intelligence: Proceedings of the Thirty-Fourth Conference

    Get PDF
    corecore