31 research outputs found

    Domain-aware Genetic Algorithms for Hardware and Mapping Optimization for Efficient DNN Acceleration

    Get PDF
    The proliferation of AI across a variety of domains (vision, language, speech, recommendations, games) has led to the rise of domain-specific accelerators for deep learning. At design-time, these accelerators carefully architect the on-chip dataflow to maximize data reuse (over space and time) and size the hardware resources (PEs and buffers) to maximize performance and energy-efficiency, while meeting the chip’s area and power targets. At compile-time, the target Deep Neural Network (DNN) model is mapped over the accelerator. The mapping refers to tiling the computation and data (i.e., tensors) and scheduling them over the PEs and scratchpad buffers respectively, while honoring the microarchitectural constraints (number of PEs, buffer sizes, and dataflow). The design-space of valid hardware resource assignments for a given dataflow and the valid mappings for a given hardware is extremely large (~O(10^24)) per layer for state-of-the-art DNN models today. This makes exhaustive searches infeasible. Unfortunately, there can be orders of magnitude performance and energy-efficiency differences between an optimal and sub-optimal choice, making these decisions a crucial part of the entire design process. Moreover, manual tuning by domain experts become unprecedentedly challenged due to increased irregularity (due to neural architecture search) and sparsity of DNN models. This necessitate the existence of Map Space Exploration (MSE). In this thesis, our goal is to deliver a deep analysis of the MSE for DNN accelerators, propose different techniques to improve MSE, and generalize the MSE framework to a wider landscape (from mapping to HW-mapping co-exploration, from single-accelerator to multi-accelerator scheduling). As part of it, we discuss the correlation between hardware flexibility and the formed map space, formalized the map space representation by four mapping axes: tile, order, parallelism, and shape. Next, we develop dedicated exploration operators for these axes and use genetic algorithm framework to converge the solution. Next, we develop "sparsity-aware" technique to enable sparsity consideration in MSE and a "warm-start" technique to solve the search speed challenge commonly seen across learning-based search algorithms. Finally, we extend out MSE to support hardware and map space co-exploration and multi-accelerator scheduling.Ph.D

    Self-timed field programmmable gate array architectures

    Get PDF

    Assessing Approximate Arithmetic Designs in the presence of Process Variations and Voltage Scaling

    Get PDF
    As environmental concerns and portability of electronic devices move to the forefront of priorities, innovative approaches which reduce processor energy consumption are sought. Approximate arithmetic units are one of the avenues whereby significant energy savings can be achieved. Approximation of fundamental arithmetic units is achieved by judiciously reducing the number of transistors in the circuit. A satisfactory tradeoff of energy vs. accuracy of the circuit can be determined by trial-and-error methods of each functional approximation. Although the accuracy of the output is compromised, it is only decreased to an acceptable extent that can still fulfill processing requirements. A number of scenarios are evaluated with approximate arithmetic units to thoroughly cross-check them with their accurate counterparts. Some of the attributes evaluated are energy consumption, delay and process variation. Additionally, novel methods to create such approximate units are developed. One such method developed uses a Genetic Algorithm (GA), which mimics the biologically-inspired evolutionary techniques to obtain an optimal solution. A GA employs genetic operators such as crossover and mutation to mix and match several different types of approximate adders to find the best possible combination of such units for a given input set. As the GA usually consumes a significant amount of time as the size of the input set increases, we tackled this problem by using various methods to parallelize the fitness computation process of the GA, which is the most compute intensive task. The parallelization improved the computation time from 2,250 seconds to 1,370 seconds for up to 8 threads, using both OpenMP and Intel TBB. Apart from using the GA with seeded multiple approximate units, other seeds such as basic logic gates with limited logic space were used to develop completely new multi-bit approximate adders with good fitness levels. iii The effect of process variation was also calculated. As the number of transistors is reduced, the distribution of the transistor widths and gate oxide may shift away from a Gaussian Curve. This result was demonstrated in different types of single-bit adders with the delay sigma increasing from 6psec to 12psec, and when the voltage is scaled to Near-Threshold-Voltage (NTV) levels sigma increases by up to 5psec. Approximate Arithmetic Units were not affected greatly by the change in distribution of the thickness of the gate oxide. Even when considering the 3-sigma value, the delay of an approximate adder remains below that of a precise adder with additional transistors. Additionally, it is demonstrated that the GA obtains innovative solutions to the appropriate combination of approximate arithmetic units, to achieve a good balance between energy savings and accuracy

    Dynamic reconfiguration frameworks for high-performance reliable real-time reconfigurable computing

    Get PDF
    The sheer hardware-based computational performance and programming flexibility offered by reconfigurable hardware like Field-Programmable Gate Arrays (FPGAs) make them attractive for computing in applications that require high performance, availability, reliability, real-time processing, and high efficiency. Fueled by fabrication process scaling, modern reconfigurable devices come with ever greater quantities of on-chip resources, allowing a more complex variety of applications to be developed. Thus, the trend is that technology giants like Microsoft, Amazon, and Baidu now embrace reconfigurable computing devices likes FPGAs to meet their critical computing needs. In addition, the capability to autonomously reprogramme these devices in the field is being exploited for reliability in application domains like aerospace, defence, military, and nuclear power stations. In such applications, real-time computing is important and is often a necessity for reliability. As such, applications and algorithms resident on these devices must be implemented with sufficient considerations for real-time processing and reliability. Often, to manage a reconfigurable hardware device as a computing platform for a multiplicity of homogenous and heterogeneous tasks, reconfigurable operating systems (ROSes) have been proposed to give a software look to hardware-based computation. The key requirements of a ROS include partitioning, task scheduling and allocation, task configuration or loading, and inter-task communication and synchronization. Existing ROSes have met these requirements to varied extents. However, they are limited in reliability, especially regarding the flexibility of placing the hardware circuits of tasks on device’s chip area, the problem arising more from the partitioning approaches used. Indeed, this problem is deeply rooted in the static nature of the on-chip inter-communication among tasks, hampering the flexibility of runtime task relocation for reliability. This thesis proposes the enabling frameworks for reliable, available, real-time, efficient, secure, and high-performance reconfigurable computing by providing techniques and mechanisms for reliable runtime reconfiguration, and dynamic inter-circuit communication and synchronization for circuits on reconfigurable hardware. This work provides task configuration infrastructures for reliable reconfigurable computing. Key features, especially reliability-enabling functionalities, which have been given little or no attention in state-of-the-art are implemented. These features include internal register read and write for device diagnosis; configuration operation abort mechanism, and tightly integrated selective-area scanning, which aims to optimize access to the device’s reconfiguration port for both task loading and error mitigation. In addition, this thesis proposes a novel reliability-aware inter-task communication framework that exploits the availability of dedicated clocking infrastructures in a typical FPGA to provide inter-task communication and synchronization. The clock buffers and networks of an FPGA use dedicated routing resources, which are distinct from the general routing resources. As such, deploying these dedicated resources for communication sidesteps the restriction of static routes and allows a better relocation of circuits for reliability purposes. For evaluation, a case study that uses a NASA/JPL spectrometer data processing application is employed to demonstrate the improved reliability brought about by the implemented configuration controller and the reliability-aware dynamic communication infrastructure. It is observed that up to 74% time saving can be achieved for selective-area error mitigation when compared to state-of-the-art vendor implementations. Moreover, an improvement in overall system reliability is observed when the proposed dynamic communication scheme is deployed in the data processing application. Finally, one area of reconfigurable computing that has received insufficient attention is security. Meanwhile, considering the nature of applications which now turn to reconfigurable computing for accelerating compute-intensive processes, a high premium is now placed on security, not only of the device but also of the applications, from loading to runtime execution. To address security concerns, a novel secure and efficient task configuration technique for task relocation is also investigated, providing configuration time savings of up to 32% or 83%, depending on the device; and resource usage savings in excess of 90% compared to state-of-the-art

    Optimization Techniques for Parallel Programming of Embedded Many-Core Computing Platforms

    Get PDF
    Nowadays many-core computing platforms are widely adopted as a viable solution to accelerate compute-intensive workloads at different scales, from low-cost devices to HPC nodes. It is well established that heterogeneous platforms including a general-purpose host processor and a parallel programmable accelerator have the potential to dramatically increase the peak performance/Watt of computing architectures. However the adoption of these platforms further complicates application development, whereas it is widely acknowledged that software development is a critical activity for the platform design. The introduction of parallel architectures raises the need for programming paradigms capable of effectively leveraging an increasing number of processors, from two to thousands. In this scenario the study of optimization techniques to program parallel accelerators is paramount for two main objectives: first, improving performance and energy efficiency of the platform, which are key metrics for both embedded and HPC systems; second, enforcing software engineering practices with the aim to guarantee code quality and reduce software costs. This thesis presents a set of techniques that have been studied and designed to achieve these objectives overcoming the current state-of-the-art. As a first contribution, we discuss the use of OpenMP tasking as a general-purpose programming model to support the execution of diverse workloads, and we introduce a set of runtime-level techniques to support fine-grain tasks on high-end many-core accelerators (devices with a power consumption greater than 10W). Then we focus our attention on embedded computer vision (CV), with the aim to show how to achieve best performance by exploiting the characteristics of a specific application domain. To further reduce the power consumption of parallel accelerators beyond the current technological limits, we describe an approach based on the principles of approximate computing, which implies modification to the program semantics and proper hardware support at the architectural level
    corecore