89 research outputs found
Effective network grid synthesis and optimization for high performance very large scale integration system design
制度:新 ; 文部省報告番号:甲2642号 ; 学位の種類:博士(工学) ; 授与年月日:2008/3/15 ; 早大学位記番号:新480
Eyeriss v2: A Flexible Accelerator for Emerging Deep Neural Networks on Mobile Devices
A recent trend in DNN development is to extend the reach of deep learning
applications to platforms that are more resource and energy constrained, e.g.,
mobile devices. These endeavors aim to reduce the DNN model size and improve
the hardware processing efficiency, and have resulted in DNNs that are much
more compact in their structures and/or have high data sparsity. These compact
or sparse models are different from the traditional large ones in that there is
much more variation in their layer shapes and sizes, and often require
specialized hardware to exploit sparsity for performance improvement. Thus,
many DNN accelerators designed for large DNNs do not perform well on these
models. In this work, we present Eyeriss v2, a DNN accelerator architecture
designed for running compact and sparse DNNs. To deal with the widely varying
layer shapes and sizes, it introduces a highly flexible on-chip network, called
hierarchical mesh, that can adapt to the different amounts of data reuse and
bandwidth requirements of different data types, which improves the utilization
of the computation resources. Furthermore, Eyeriss v2 can process sparse data
directly in the compressed domain for both weights and activations, and
therefore is able to improve both processing speed and energy efficiency with
sparse models. Overall, with sparse MobileNet, Eyeriss v2 in a 65nm CMOS
process achieves a throughput of 1470.6 inferences/sec and 2560.3 inferences/J
at a batch size of 1, which is 12.6x faster and 2.5x more energy efficient than
the original Eyeriss running MobileNet. We also present an analysis methodology
called Eyexam that provides a systematic way of understanding the performance
limits for DNN processors as a function of specific characteristics of the DNN
model and accelerator design; it applies these characteristics as sequential
steps to increasingly tighten the bound on the performance limits.Comment: accepted for publication in IEEE Journal on Emerging and Selected
Topics in Circuits and Systems. This extended version on arXiv also includes
Eyexam in the appendi
Baseband Processing for 5G and Beyond: Algorithms, VLSI Architectures, and Co-design
In recent years the number of connected devices and the demand for high data-rates have been significantly increased. This enormous growth is more pronounced by the introduction of the Internet of things (IoT) in which several devices are interconnected to exchange data for various applications like smart homes and smart cities. Moreover, new applications such as eHealth, autonomous vehicles, and connected ambulances set new demands on the reliability, latency, and data-rate of wireless communication systems, pushing forward technology developments. Massive multiple-input multiple-output (MIMO) is a technology, which is employed in the 5G standard, offering the benefits to fulfill these requirements. In massive MIMO systems, base station (BS) is equipped with a very large number of antennas, serving several users equipments (UEs) simultaneously in the same time and frequency resource. The high spatial multiplexing in massive MIMO systems, improves the data rate, energy and spectral efficiencies as well as the link reliability of wireless communication systems. The link reliability can be further improved by employing channel coding technique. Spatially coupled serially concatenated codes (SC-SCCs) are promising channel coding schemes, which can meet the high-reliability demands of wireless communication systems beyond 5G (B5G). Given the close-to-capacity error correction performance and the potential to implement a high-throughput decoder, this class of code can be a good candidate for wireless systems B5G. In order to achieve the above-mentioned advantages, sophisticated algorithms are required, which impose challenges on the baseband signal processing. In case of massive MIMO systems, the processing is much more computationally intensive and the size of required memory to store channel data is increased significantly compared to conventional MIMO systems, which are due to the large size of the channel state information (CSI) matrix. In addition to the high computational complexity, meeting latency requirements is also crucial. Similarly, the decoding-performance gain of SC-SCCs also do come at the expense of increased implementation complexity. Moreover, selecting the proper choice of design parameters, decoding algorithm, and architecture will be challenging, since spatial coupling provides new degrees of freedom in code design, and therefore the design space becomes huge. The focus of this thesis is to perform co-optimization in different design levels to address the aforementioned challenges/requirements. To this end, we employ system-level characteristics to develop efficient algorithms and architectures for the following functional blocks of digital baseband processing. First, we present a fast Fourier transform (FFT), an inverse FFT (IFFT), and corresponding reordering scheme, which can significantly reduce the latency of orthogonal frequency-division multiplexing (OFDM) demodulation and modulation as well as the size of reordering memory. The corresponding VLSI architectures along with the application specific integrated circuit (ASIC) implementation results in a 28 nm CMOS technology are introduced. In case of a 2048-point FFT/IFFT, the proposed design leads to 42% reduction in the latency and size of reordering memory. Second, we propose a low-complexity massive MIMO detection scheme. The key idea is to exploit channel sparsity to reduce the size of CSI matrix and eventually perform linear detection followed by a non-linear post-processing in angular domain using the compressed CSI matrix. The VLSI architecture for a massive MIMO with 128 BS antennas and 16 UEs along with the synthesis results in a 28 nm technology are presented. As a result, the proposed scheme reduces the complexity and required memory by 35%–73% compared to traditional detectors while it has better detection performance. Finally, we perform a comprehensive design space exploration for the SC-SCCs to investigate the effect of different design parameters on decoding performance, latency, complexity, and hardware cost. Then, we develop different decoding algorithms for the SC-SCCs and discuss the associated decoding performance and complexity. Also, several high-level VLSI architectures along with the corresponding synthesis results in a 12 nm process are presented, and various design tradeoffs are provided for these decoding schemes
Experimental Benchmarks and Initial Evaluation of the Performance of the PASM System Prototype
The work reported here represents experiences with the PASM parallel processing system prototype during its first operational year. Most of the experiments were performed by students in the Fall semester of 1987. The first programming, and the first timing measurements, were made during the summer of 1987 by Sam Fineberg. The goal of the collection of experiments presented here was to undertake an Application-driven Architecture Study of the PASM system as a paradigm for parallel architecture evaluation in general. PASM was an excellent vehicle for experimenting with this evaluation technique due to its unique architectural features. Among these are: 1. A reconfigurable, partitionable multistage circuit-switched network. 2. Support for both SIMD and MIMD programs. 3. Ability to execute hybrid SIMD/MIMD programs. 4. An instruction queue which allows overlap of control-flow and data manipulation between micro-control (MC) units and processing elements (PE). It had been hypothesized that superlinear speed-up over the number of PEs could be attained with this feature, and experimental results verified this. 5. Support for barrier synchronization of MIMD tasks. This feature was exploited in some non-standard ways to show the ability to decouple variant length SIMD instructions into multiple MIMD streams for an overall performance benefit. This type of study is expected to continue in the future on PASM and other parallel machines at Purdue. This report should serve as a guide for this future work as well
Recommended from our members
Efficient FPGA implementation and power modelling of image and signal processing IP cores
This thesis was submitted for the degree of Doctor of Philosophy and awarded by Brunel University.Field Programmable Gate Arrays (FPGAs) are the technology of choice in a number ofimage
and signal processing application areas such as consumer electronics, instrumentation,
medical data processing and avionics due to their reasonable energy consumption, high performance, security, low design-turnaround time and reconfigurability. Low power FPGA
devices are also emerging as competitive solutions for mobile and thermally constrained platforms. Most computationally intensive image and signal processing algorithms also consume a lot of power leading to a number of issues including reduced mobility, reliability concerns and increased design cost among others. Power dissipation has become one of the most important challenges, particularly for FPGAs. Addressing this problem requires optimisation and awareness at all levels in the design flow. The key achievements of the
work presented in this thesis are summarised here. Behavioural level optimisation strategies have been used for implementing matrix product and inner product through the use of mathematical techniques such as Distributed Arithmetic (DA) and its variations including offset binary coding, sparse factorisation and novel vector level transformations. Applications to test the impact of these algorithmic and arithmetic transformations include the fast Hadamard/Walsh transforms and Gaussian mixture models. Complete design space exploration has been performed on these cores, and where appropriate, they have been shown to clearly outperform comparable existing implementations. At the architectural level, strategies such as parallelism, pipelining and systolisation have been successfully applied for the design and optimisation of a number of
cores including colour space conversion, finite Radon transform, finite ridgelet transform and circular convolution. A pioneering study into the influence of supply voltage scaling for FPGA based designs, used in conjunction with performance enhancing strategies such as parallelism and pipelining has been performed. Initial results are very promising and indicated significant potential for future research in this area.
A key contribution of this work includes the development of a novel high level power macromodelling technique for design space exploration and characterisation of custom IP cores for FPGAs, called Functional Level Power Analysis and Modelling (FLPAM). FLPAM
is scalable, platform independent and compares favourably with existing approaches. A hybrid, top-down design flow paradigm integrating FLPAM with commercially available design tools for systematic optimisation of IP cores has also been developed
Toward Full-Stack Acceleration of Deep Convolutional Neural Networks on FPGAs
Due to the huge success and rapid development of convolutional neural networks (CNNs), there is a growing demand for hardware accelerators that accommodate a variety of CNNs to improve their inference latency and energy efficiency, in order to enable their deployment in real-time applications. Among popular platforms, field-programmable gate arrays (FPGAs) have been widely adopted for CNN acceleration because of their capability to provide superior energy efficiency and low-latency processing, while supporting high reconfigurability, making them favorable for accelerating rapidly evolving CNN algorithms. This article introduces a highly customized streaming hardware architecture that focuses on improving the compute efficiency for streaming applications by providing full-stack acceleration of CNNs on FPGAs. The proposed accelerator maps most computational functions, that is, convolutional and deconvolutional layers into a singular unified module, and implements the residual and concatenative connections between the functions with high efficiency, to support the inference of mainstream CNNs with different topologies. This architecture is further optimized through exploiting different levels of parallelism, layer fusion, and fully leveraging digital signal processing blocks (DSPs). The proposed accelerator has been implemented on Intel's Arria 10 GX1150 hardware and evaluated with a wide range of benchmark models. The results demonstrate a high performance of over 1.3 TOP/s of throughput, up to 97% of compute [multiply-accumulate (MAC)] efficiency, which outperforms the state-of-the-art FPGA accelerators
Efficient reconfigurable architectures for 3D medical image compression
This thesis was submitted for the degree of Doctor of Philosophy and awarded by Brunel University.Recently, the more widespread use of three-dimensional (3-D) imaging modalities,
such as magnetic resonance imaging (MRI), computed tomography (CT), positron
emission tomography (PET), and ultrasound (US) have generated a massive amount
of volumetric data. These have provided an impetus to the development of other
applications, in particular telemedicine and teleradiology. In these fields, medical
image compression is important since both efficient storage and transmission of data
through high-bandwidth digital communication lines are of crucial importance.
Despite their advantages, most 3-D medical imaging algorithms are computationally intensive with matrix transformation as the most fundamental operation involved in the transform-based methods. Therefore, there is a real need for high-performance systems, whilst keeping architectures exible to allow
for quick upgradeability with real-time applications. Moreover, in order to obtain
efficient solutions for large medical volumes data, an efficient implementation of
these operations is of significant importance. Reconfigurable hardware, in the form of field programmable gate arrays (FPGAs) has been proposed as viable system
building block in the construction of high-performance systems at an economical price.
Consequently, FPGAs seem an ideal candidate to harness and exploit their inherent
advantages such as massive parallelism capabilities, multimillion gate counts, and
special low-power packages. The key achievements of the work presented in this thesis are summarised as follows. Two architectures for 3-D Haar wavelet transform (HWT) have been proposed based on transpose-based computation and partial reconfiguration suitable for 3-D medical imaging applications. These applications require continuous hardware servicing, and as a result dynamic partial reconfiguration (DPR) has been introduced. Comparative study for both non-partial and partial reconfiguration implementation has shown that DPR offers many advantages and leads to a compelling solution for implementing computationally intensive applications such as 3-D medical image compression. Using DPR, several large systems are mapped to small hardware resources, and the area, power consumption as well as maximum frequency are
optimised and improved. Moreover, an FPGA-based architecture of the finite Radon transform (FRAT)with three design strategies has been proposed: direct implementation of pseudo-code with a sequential or pipelined description, and block random access memory (BRAM)- based method. An analysis with various medical imaging modalities has been carried out. Results obtained for image de-noising implementation using FRAT exhibits
promising results in reducing Gaussian white noise in medical images. In terms of
hardware implementation, promising trade-offs on maximum frequency, throughput
and area are also achieved. Furthermore, a novel hardware implementation of 3-D medical image compression system with context-based adaptive variable length coding (CAVLC)
has been proposed. An evaluation of the 3-D integer transform (IT) and the discrete
wavelet transform (DWT) with lifting scheme (LS) for transform blocks reveal that
3-D IT demonstrates better computational complexity than the 3-D DWT, whilst
the 3-D DWT with LS exhibits a lossless compression that is significantly useful for
medical image compression. Additionally, an architecture of CAVLC that is capable
of compressing high-definition (HD) images in real-time without any buffer between
the quantiser and the entropy coder is proposed. Through a judicious parallelisation, promising results have been obtained with limited resources. In summary, this research is tackling the issues of massive 3-D medical volumes data that requires compression as well as hardware implementation to accelerate the
slowest operations in the system. Results obtained also reveal a significant achievement in terms of the architecture efficiency and applications performance.Ministry of Higher Education Malaysia (MOHE),
Universiti Tun Hussein Onn Malaysia (UTHM) and the British Counci
Efficient parallel processing with optical interconnections
With the advances in VLSI technology, it is now possible to build chips which can each contain thousands of processors. The efficiency of such chips in executing parallel algorithms heavily depends on the interconnection topology of the processors. It is not possible to build a fully interconnected network of processors with constant fan-in/fan-out using electrical interconnections. Free space optics is a remedy to this limitation. Qualities exclusive to the optical medium are its ability to be directed for propagation in free space and the property that optical channels can cross in space without any interference. In this thesis, we present an electro-optical interconnected architecture named Optical Reconfigurable Mesh (ORM). It is based on an existing optical model of computation. There are two layers in the architecture. The processing layer is a reconfigurable mesh and the deflecting layer contains optical devices to deflect light beams. ORM provides three types of communication mechanisms. The first is for arbitrary planar connections among sets of locally connected processors using the reconfigurable mesh. The second is for arbitrary connections among N of the processors using the electrical buses on the processing layer and N2 fixed passive deflecting units on the deflection layer. The third is for arbitrary connections among any of the N2 processors using the N2 mechanically reconfigurable deflectors in the deflection layer. The third type of communication mechanisms is significantly slower than the other two. Therefore, it is desirable to avoid reconfiguring this type of communication during the execution of the algorithms. Instead, the optical reconfiguration can be done before the execution of each algorithm begins. Determining a right configuration that would be suitable for the entire configuration of a task execution is studied in this thesis. The basic data movements for each of the mechanisms are studied. Finally, to show the power of ORM, we use all three types of communication mechanisms in the first O(logN) time algorithm for finding the convex hulls of all figures in an N x N binary image presented in this thesis
Distribution of Low Latency Machine Learning Algorithm
Mobile networks are evolving towards centralization and cloudification while bringing computing power to the edge, opening its scope to a new range of applications. Ultra-low latency is one of the requirements of such applications in the next generation of mobile networks (5G), where deep learning is expected to play a big role. Hence, to enable the usage of deep learning solutions on the edge cloud, ultra-low latency inference must be investigated.
The study presented here relies on the usage of an in-house framework (CRUN) that enables the distribution of acceleration on data center environment. The objective of this thesis is to leverage the best solution for the inference of a machine learning algorithm for an anomaly detection application using neural networks in the edge cloud context. To evaluate the obtained results with CRUN a comparison work is also carried out. Five inference solutions were compared using CPU, GPU and FPGA.
The results show a superior performance in terms of latency for all CRUN experiments, that basically comprehends three cases. The first one utilizing the RTL anomaly detection neural network as a baseline solution, the second using the same baseline code but unrolling the biggest layer for obtaining reduced latency and the third by distributing the neural network in two FPGAs. The requirements for this solution were to obtain latency between 20 μs to 40 μs for inference time and at least 20000 inferences per second. These goals were categorically fulfilled for all CRUN experiments, providing 30 μs latency in average, while the second best solution provided 272 μs
A symbolic emulator for shuffle synthesis on the NVIDIA PTX code
Various kinds of applications take advantage of GPUs through automation tools that attempt to automatically exploit the available performance of the GPU's parallel architecture. Directive-based programming models, such as OpenACC, are one such method that easily enables parallel computing by just adhering code annotations to code loops. Such abstract models, however, often prevent programmers from making additional low-level optimizations to take advantage of the advanced architectural features of GPUs because the actual generated computation is hidden from the application developer. This paper describes and implements a novel flexible optimization technique that operates by inserting a code emulator phase to the tail-end of the compilation pipeline. Our tool emulates the generated code using symbolic analysis by substituting dynamic information and thus allowing for further low-level code optimizations to be applied. We implement our tool to support both CUDA and OpenACC directives as the frontend of the compilation pipeline, thus enabling low-level GPU optimizations for OpenACC that were not previously possible. We demonstrate the capabilities of our tool by automating warp-level shuffle instructions that are difficult to use by even advanced GPU programmers. Lastly, evaluating our tool with a benchmark suite and complex application code, we provide a detailed study to assess the benefits of shuffle instructions across four generations of GPU architectures.We are funded by the EPEEC project from the European Union’s Horizon 2020 research and innovation program under grant agreement No. 801051 and the Ministerio de Ciencia e Innovación-Agencia Estatal de Investigación (PID2019-107255GB-C21/AEI/10.13039/501100011033). This work has been partially carried out on the ACME cluster owned by CIEMAT and funded by the Spanish Ministry of Economy and Competitiveness project CODEC-OSE (RTI2018-096006-B-I00).Peer ReviewedPostprint (published version
- …