7 research outputs found
Efficient Hardware Architectures for Accelerating Deep Neural Networks: Survey
In the modern-day era of technology, a paradigm shift has been witnessed in the areas involving applications of Artificial Intelligence (AI), Machine Learning (ML), and Deep Learning (DL). Specifically, Deep Neural Networks (DNNs) have emerged as a popular field of interest in most AI applications such as computer vision, image and video processing, robotics, etc. In the context of developed digital technologies and the availability of authentic data and data handling infrastructure, DNNs have been a credible choice for solving more complex real-life problems. The performance and accuracy of a DNN is a way better than human intelligence in certain situations. However, it is noteworthy that the DNN is computationally too cumbersome in terms of the resources and time to handle these computations. Furthermore, general-purpose architectures like CPUs have issues in handling such computationally intensive algorithms. Therefore, a lot of interest and efforts have been invested by the research fraternity in specialized hardware architectures such as Graphics Processing Unit (GPU), Field Programmable Gate Array (FPGA), Application Specific Integrated Circuit (ASIC), and Coarse Grained Reconfigurable Array (CGRA) in the context of effective implementation of computationally intensive algorithms. This paper brings forward the various research works carried out on the development and deployment of DNNs using the aforementioned specialized hardware architectures and embedded AI accelerators. The review discusses the detailed description of the specialized hardware-based accelerators used in the training and/or inference of DNN. A comparative study based on factors like power, area, and throughput, is also made on the various accelerators discussed. Finally, future research and development directions are discussed, such as future trends in DNN implementation on specialized hardware accelerators. This review article is intended to serve as a guide for hardware architectures for accelerating and improving the effectiveness of deep learning research.publishedVersio
Methodology for Structured Data-Path Implementation in VLSI Physical Design: A Case Study
State-of-the-art modern microprocessor and domain-specific accelerator designs are dominated by data-paths composed of regular structures, also known as bit-slices. Random logic placement and routing techniques may not result in an optimal layout for these data-path-dominated designs. As a result, implementation tools such as Cadence’s Innovus include a Structured Data-Path (SDP) feature that allows data-path placement to be completely customized by constraining the placement engine. A relative placement file is used to provide these constraints to the tool. However, the tool neither extracts nor automatically places the regular data-path structures. In other words, the relative placement file is not automatically generated. In this paper, we propose a semi-automated method for extracting bit-slices from the Innovus SDP flow. It has been demonstrated that the proposed method results in 17% less density or use for a pixel buffer design. At the same time, the other performance metrics are unchanged when compared to the traditional place and route flow.publishedVersio
Codegenerierung für eng gekoppelte Prozessorfelder
In this dissertation, we consider techniques for automatic code generation and
code optimization of loop programs for programmable tightly coupled processor
array targets. These consist of interconnected small light-weight very
long instruction word cores, which can exploit both loop-level parallelism and
instruction-level parallelism. These arrays are well suited for executing computeintensive
nested loop applications, often providing a higher power and area
efficiency compared with commercial off-the-shelf processors. They are ideal
candidates for accelerating the computation of nested loop programs in future
heterogeneous systems, where energy efficiency is one of the most important
design goals for overall system-on-chip design. In order to harness the full compute
potential of such an array, we need efficient compiler techniques which can
automatically map nested loop programs onto them. Such a compiler framework
is essential for increasing the productivity of designers as well as for shortening
development cycles. In this context, this dissertation proposes a novel code
generation and compaction approach which generates the assembly-level codes
for all the processing elements in an array from a scheduled loop nest. The
code generation approach itself is independent of the array size, preserves the
given schedule, and is independent of the problem size. As part of this compiler
framework, we also present a scalable interconnect generation approach where
the connections among different processing elements are automatically generated
from the same scheduled loop program. Furthermore, we consider the
integration of a tightly coupled processor array into a multi-processor systemon-
chip: Here, we propose the design of new hardware components such as a
global controller, which generates control signals to orchestrate (synchronize)
the programs running on the different processing elements, and address generators,
which are required to generate the address as well as enable signals for
a set of reconfigurable I/O buffers surrounding the processor array. We propose
a fully programmable design of these required hardware components and
add the required compiler support to generate the configuration data from the
same scheduled loop program as well. In summary, the major contributions of
this dissertation enable and ease the fully automated mapping of nested loop
programs onto tightly coupled processor arrays