74 research outputs found

    Design of Efficient DNN Accelerator Architectures

    Get PDF
    Deep Neural Networks (DNNs) are the fundamental processing unit behind modern Artificial Intelligence (AI). Accordingly, expecting a future with smart devices that are able to monitor, decide, and take action seems reasonable. However, DNNs are computation and power-hungry, which makes deployment of them into edge devices challenging. The focus of this dissertation is on designing architectures to perform the inference of DNNs efficiently. The contents of this dissertation can be divided into four specific areas: (1) early detection of the ineffectual computations inside the computation engine; (2) enhancing the utilization of Processing Elements (PEs) inside the computation engine; (3) skipping identical effectual computations through binary Multiply and Accumulation (MAC) operations; (4) the design of approximate DNN accelerators. In most DNNs, an activation function follows a convolutional or a fully connected layer. Several popular activation functions involve setting all negative inputs to zero. In this dissertation, firstly, the characteristics of activation layers that are considered for adding non-linearity to DNNs are studied. Then, a novel architecture in which the activation function is merged with the prior computational layer is proposed. To add more detail, the proposed architecture coordinates early sign detection of output features. When compared to the original design, our method achieves a speedup of ×2.19 and reduces energy consumption by ×1.94. The average reduction in the number of multiply-accumulate~(MAC) operations is 10.64% and the average reduction in the number of load operations is 3.86%. These improvements are achieved while maintaining classification accuracy in two popular benchmark networks. One of the main challenges that DNN accelerator developers face is keeping all the PEs busy with performing effectual computations while running DNNs. In this dissertation, a Twin-PE for spatial DNN accelerators is introduced that increases the utilization of the PEs and the performance of the whole computation engine. In more detail, the proposed architecture which comes with a negligible area overhead is implemented based on sharing the scratchpads between the PEs to use the available slack time caused by applying computation-pruning techniques. When compared to the reference design, our proposed method achieves a speedup of ×1.24 and an energy efficiency of ×1.18 per inference. Decomposing the MAC operations down to bit-level provides the chance of skipping bit-wise and word-wise sparsity. However, there is still room for pruning the effectual computations without reducing the accuracy of DNNs. In this dissertation, a novel real-time architecture by decomposing multiplications down to bit level and pruning identical computations while running benchmark networks. Our proposed design achieves an average per layer speedup of ×1.4 and energy efficiency of ×1.21 per inference while maintaining the accuracy of benchmark networks. Applying approximate computing techniques reduces the cost of the underlying circuits so that DNN inference would be performed more efficiently. However, applying approximation to DNNs is somehow different from other applications. In this dissertation, a step-wise approach for implementing a re-configurable Booth multiplier suitable for inference of DNNs is proposed. In addition, the tolerance of different layers of DNNs to approximation is evaluated and the effect of applying various degrees of approximation on inference accuracy is explored. The proposed design achieves an area efficiency of ×1.19 and energy efficiency of ×1.28 compared to the exact design while running benchmark DNNs

    Rethinking FPGA Architectures for Deep Neural Network applications

    Get PDF
    The prominence of machine learning-powered solutions instituted an unprecedented trend of integration into virtually all applications with a broad range of deployment constraints from tiny embedded systems to large-scale warehouse computing machines. While recent research confirms the edges of using contemporary FPGAs to deploy or accelerate machine learning applications, especially where the latency and energy consumption are strictly limited, their pre-machine learning optimised architectures remain a barrier to the overall efficiency and performance. Realizing this shortcoming, this thesis demonstrates an architectural study aiming at solutions that enable hidden potentials in the FPGA technology, primarily for machine learning algorithms. Particularly, it shows how slight alterations to the state-of-the-art architectures could significantly enhance the FPGAs toward becoming more machine learning-friendly while maintaining the near-promised performance for the rest of the applications. Eventually, it presents a novel systematic approach to deriving new block architectures guided by designing limitations and machine learning algorithm characteristics through benchmarking. First, through three modifications to Xilinx DSP48E2 blocks, an enhanced digital signal processing (DSP) block for important computations in embedded deep neural network (DNN) accelerators is described. Then, two tiers of modifications to FPGA logic cell architecture are explained that deliver a variety of performance and utilisation benefits with only minor area overheads. Eventually, with the goal of exploring this new design space in a methodical manner, a problem formulation involving computing nested loops over multiply-accumulate (MAC) operations is first proposed. A quantitative methodology for deriving efficient coarse-grained compute block architectures from benchmarks is then suggested together with a family of new embedded blocks, called MLBlocks

    Energy-precision tradeoffs in the graphics pipeline

    Get PDF
    The energy consumption of a graphics processing unit (GPU) is an important factor in its design, whether for a server, desktop, or mobile device. Mobile products, such as smart phones, tablets, and laptop computers, rely on batteries to function; the less the demand for power is on these batteries, the longer they will last before needing to be recharged. GPUs used in servers and desktops, while not dependent on a battery for operation, are still limited by the efficiency of power supplies and heat dissipation techniques. In this dissertation, I propose to lower the energy consumption of GPUs by reducing the precision of floating-point arithmetic in the graphics pipeline and the data sent and stored on- and off-chip. The key idea behind this work is twofold: energy can be saved through a systematic and targeted reduction in the number of bits 1) computed and 2) communicated. Reducing the number of bits computed will necessarily reduce either the precision or range of a floating point number. I focus on saving energy by way of reducing precision, which can exploit the over-provisioning of bits in many stages of the graphics pipeline. Reducing the number of bits communicated takes several forms. First, I propose enhancements to existing compression schemes for off-chip buffers to save bandwidth. I also suggest a simple extension that exploits unused bits in reduced-precision data undergoing compression. Finally, I present techniques for saving energy in on-chip communication of reduced-precision data. By designing and simulating variable-precision arithmetic circuits with promising energy versus precision characteristics and tradeoffs, I have developed an energy model for GPUs. Using this model and my techniques, I have shown that significant savings (up to 70% in computation in the vertex and pixel shader stages) are possible by reducing the precision of the arithmetic. Further, my compression approaches have enabled improvements of 1.26x over past work, and a general-purpose compressor design has achieved bandwidth savings of 34%, 87%, and 65% for color, depth, and geometry data, respectively, which is competitive with past work. Lastly, an initial exploration in signal gating unused lines in on-chip buses has suggested savings of 13-48% for the tested applications' traffic from a multiprocessor's register file to its L1 cache

    Efficient compression of motion compensated residuals

    Get PDF
    EThOS - Electronic Theses Online ServiceGBUnited Kingdo

    Advanced space communications architecture study. Volume 2: Technical report

    Get PDF
    The technical feasibility and economic viability of satellite system architectures that are suitable for customer premise service (CPS) communications are investigated. System evaluation is performed at 30/20 GHz (Ka-band); however, the system architectures examined are equally applicable to 14/11 GHz (Ku-band). Emphasis is placed on systems that permit low-cost user terminals. Frequency division multiple access (FDMA) is used on the uplink, with typically 10,000 simultaneous accesses per satellite, each of 64 kbps. Bulk demodulators onboard the satellite, in combination with a baseband multiplexer, convert the many narrowband uplink signals into a small number of wideband data streams for downlink transmission. Single-hop network interconnectivity is accomplished via downlink scanning beams. Each satellite is estimated to weigh 5600 lb and consume 6850W of power; the corresponding payload totals are 1000 lb and 5000 W. Nonrecurring satellite cost is estimated at 110million,withthefirstunitcostat110 million, with the first-unit cost at 113 million. In large quantities, the user terminal cost estimate is $25,000. For an assumed traffic profile, the required system revenue has been computed as a function of the internal rate of return (IRR) on invested capital. The equivalent user charge per-minute of 64-kbps channel service has also been determined

    The 4th Conference of PhD Students in Computer Science

    Get PDF

    Acta Cybernetica : Volume 21. Number 1.

    Get PDF

    Numerical solutions of differential equations on FPGA-enhanced computers

    Get PDF
    Conventionally, to speed up scientific or engineering (S&E) computation programs on general-purpose computers, one may elect to use faster CPUs, more memory, systems with more efficient (though complicated) architecture, better software compilers, or even coding with assembly languages. With the emergence of Field Programmable Gate Array (FPGA) based Reconfigurable Computing (RC) technology, numerical scientists and engineers now have another option using FPGA devices as core components to address their computational problems. The hardware-programmable, low-cost, but powerful “FPGA-enhanced computer” has now become an attractive approach for many S&E applications. A new computer architecture model for FPGA-enhanced computer systems and its detailed hardware implementation are proposed for accelerating the solutions of computationally demanding and data intensive numerical PDE problems. New FPGAoptimized algorithms/methods for rapid executions of representative numerical methods such as Finite Difference Methods (FDM) and Finite Element Methods (FEM) are designed, analyzed, and implemented on it. Linear wave equations based on seismic data processing applications are adopted as the targeting PDE problems to demonstrate the effectiveness of this new computer model. Their sustained computational performances are compared with pure software programs operating on commodity CPUbased general-purpose computers. Quantitative analysis is performed from a hierarchical set of aspects as customized/extraordinary computer arithmetic or function units, compact but flexible system architecture and memory hierarchy, and hardwareoptimized numerical algorithms or methods that may be inappropriate for conventional general-purpose computers. The preferable property of in-system hardware reconfigurability of the new system is emphasized aiming at effectively accelerating the execution of complex multi-stage numerical applications. Methodologies for accelerating the targeting PDE problems as well as other numerical PDE problems, such as heat equations and Laplace equations utilizing programmable hardware resources are concluded, which imply the broad usage of the proposed FPGA-enhanced computers

    Energy efficient hardware acceleration of multimedia processing tools

    Get PDF
    The world of mobile devices is experiencing an ongoing trend of feature enhancement and generalpurpose multimedia platform convergence. This trend poses many grand challenges, the most pressing being their limited battery life as a consequence of delivering computationally demanding features. The envisaged mobile application features can be considered to be accelerated by a set of underpinning hardware blocks Based on the survey that this thesis presents on modem video compression standards and their associated enabling technologies, it is concluded that tight energy and throughput constraints can still be effectively tackled at algorithmic level in order to design re-usable optimised hardware acceleration cores. To prove these conclusions, the work m this thesis is focused on two of the basic enabling technologies that support mobile video applications, namely the Shape Adaptive Discrete Cosine Transform (SA-DCT) and its inverse, the SA-IDCT. The hardware architectures presented in this work have been designed with energy efficiency in mind. This goal is achieved by employing high level techniques such as redundant computation elimination, parallelism and low switching computation structures. Both architectures compare favourably against the relevant pnor art in the literature. The SA-DCT/IDCT technologies are instances of a more general computation - namely, both are Constant Matrix Multiplication (CMM) operations. Thus, this thesis also proposes an algorithm for the efficient hardware design of any general CMM-based enabling technology. The proposed algorithm leverages the effective solution search capability of genetic programming. A bonus feature of the proposed modelling approach is that it is further amenable to hardware acceleration. Another bonus feature is an early exit mechanism that achieves large search space reductions .Results show an improvement on state of the art algorithms with future potential for even greater savings
    corecore