2,490 research outputs found

    An efficient multiple precision floating-point Multiply-Add Fused unit

    Get PDF
    Multiply-Add Fused (MAF) units play a key role in the processor's performance for a variety of applications. The objective of this paper is to present a multi-functional, multiple precision floating-point Multiply-Add Fused (MAF) unit. The proposed MAF is reconfigurable and able to execute a quadruple precision MAF instruction, or two double precision instructions, or four single precision instructions in parallel. The MAF architecture features a dual-path organization reducing the latency of the floating-point add (FADD) instruction and utilizes the minimum number of operating components to keep the area low. The proposed MAF design was implemented on a 65 nm silicon process achieving a maximum operating frequency of 293.5 MHz at 381 mW power

    Flexible Multiple-Precision Fused Arithmetic Units for Efficient Deep Learning Computation

    Get PDF
    Deep Learning has achieved great success in recent years. In many fields of applications, such as computer vision, biomedical analysis, and natural language processing, deep learning can achieve a performance that is even better than human-level. However, behind this superior performance is the expensive hardware cost required to implement deep learning operations. Deep learning operations are both computation intensive and memory intensive. Many research works in the literature focused on improving the efficiency of deep learning operations. In this thesis, special focus is put on improving deep learning computation and several efficient arithmetic unit architectures are proposed and optimized for deep learning computation. The contents of this thesis can be divided into three parts: (1) the optimization of general-purpose arithmetic units for deep learning computation; (2) the design of deep learning specific arithmetic units; (3) the optimization of deep learning computation using 3D memory architecture. Deep learning models are usually trained on graphics processing unit (GPU) and the computations are done with single-precision floating-point numbers. However, recent works proved that deep learning computation can be accomplished with low precision numbers. The half-precision numbers are becoming more and more popular in deep learning computation due to their lower hardware cost compared to the single-precision numbers. In conventional floating-point arithmetic units, single-precision and beyond are well supported to achieve a better precision. However, for deep learning computation, since the computations are intensive, low precision computation is desired to achieve better throughput. As the popularity of half-precision raises, half-precision operations are also need to be supported. Moreover, the deep learning computation contains many dot-product operations and therefore, the support of mixed-precision dot-product operations can be explored in a multiple-precision architecture. In this thesis, a multiple-precision fused multiply-add (FMA) architecture is proposed. It supports half/single/double/quadruple-precision FMA operations. In addition, it also supports 2-term mixed-precision dot-product operations. Compared to the conventional multiple-precision FMA architecture, the newly added half-precision support and mixed-precision dot-product only bring minor resource overhead. The proposed FMA can be used as general-purpose arithmetic unit. Due to the support of parallel half-precision computations and mixed-precision dot-product computations, it is especially suitable for deep learning computation. For the design of deep learning specific computation unit, more optimizations can be performed. First, a fixed-point and floating-point merged multiply-accumulate (MAC) unit is proposed. As deep learning computation can be accomplished with low precision number formats, the support of high precision floating-point operations can be eliminated. In this design, the half-precision floating-point format is supported to provide a large dynamic range to handle small gradients for deep learning training. For deep learning inference, 8-bit fixed-point 2-term dot-product computation is supported. Second, a flexible multiple-precision MAC unit architecture is proposed. The proposed MAC unit supports both fixed-point operations and floating-point operations. For floating-point format, the proposed unit supports one 16-bit MAC operation or sum of two 8-bit multiplications plus a 16-bit addend. To make the proposed MAC unit more versatile, the bit-width of exponent and mantissa can be flexibly exchanged. By setting the bit-width of exponent to zero, the proposed MAC unit also supports fixed-point operations. For fixed-point format, the proposed unit supports one 16-bit MAC or sum of two 8-bit multiplications plus a 16-bit addend. Moreover, the proposed unit can be further divided to support sum of four 4-bit multiplications plus a 16-bit addend. At the lowest precision, the proposed MAC unit supports accumulating of eight 1-bit logic AND operations to enable the support of binary neural networks. Finally, a MAC architecture based on the posit format, a promising numerical format in deep learning computation, is proposed to facilitate the use of posit format in deep learning computation. In addition to the above mention arithmetic units, an improved hybrid memory cube (HMC) architecture is proposed for weight-sharing deep neural network processing. By modifying the HMC instruction set and HMC logic layer, the major part of the deep learning computation can be accomplished inside memory. The proposed design reduces the memory bandwidth requirements and thus reduces the energy consumed by memory data transfer

    ARITHMETIC LOGIC UNIT ARCHITECTURES WITH DYNAMICALLY DEFINED PRECISION

    Get PDF
    Modern central processing units (CPUs) employ arithmetic logic units (ALUs) that support statically defined precisions, often adhering to industry standards. Although CPU manufacturers highly optimize their ALUs, industry standard precisions embody accuracy and performance compromises for general purpose deployment. Hence, optimizing ALU precision holds great potential for improving speed and energy efficiency. Previous research on multiple precision ALUs focused on predefined, static precisions. Little previous work addressed ALU architectures with customized, dynamically defined precision. This dissertation presents approaches for developing dynamic precision ALU architectures for both fixed-point and floating-point to enable better performance, energy efficiency, and numeric accuracy. These new architectures enable dynamically defined precision, including support for vectorization. The new architectures also prevent performance and energy loss due to applying unnecessarily high precision on computations, which often happens with statically defined standard precisions. The new ALU architectures support different precisions through the use of configurable sub-blocks, with this dissertation including demonstration implementations for floating point adder, multiply, and fused multiply-add (FMA) circuits with 4-bit sub-blocks. For these circuits, the dynamic precision ALU speed is nearly the same as traditional ALU approaches, although the dynamic precision ALU is nearly twice as large

    Design and implementation of an out of order execution engine of floating point arithmetic operations

    Get PDF
    In this thesis, work is undertaken towards the design in hardware description languages and implementation in FPGA of an out of order execution engine of floating point arithmetic operations. This thesis work, is part of a project called Lagarto

    Power and Thermal Management of System-on-Chip

    Get PDF

    HIGH-SPEED CO-PROCESSORS BASED ON REDUNDANT NUMBER SYSTEMS

    Get PDF
    There is a growing demand for high-speed arithmetic co-processors for use in applications with computationally intensive tasks. For instance, Fast Fourier Transform (FFT) co-processors are used in real-time multimedia services and financial applications use decimal co-processors to perform large amounts of decimal computations. Using redundant number systems to eliminate word-wide carry propagation within interim operations is a well-known technique to increase the speed of arithmetic hardware units. Redundant number systems are mostly useful in applications where many consecutive arithmetic operations are performed prior to the final result, making it advantageous for arithmetic co-processors. This thesis discusses the implementation of two popular arithmetic co-processors based on redundant number systems: namely, the binary FFT co-processor and the decimal arithmetic co-processor. FFT co-processors consist of several consecutive multipliers and adders over complex numbers. FFT architectures are implemented based on fixed-point and floating-point arithmetic. The main advantage of floating-point over fixed-point arithmetic is the wide dynamic range it introduces. Moreover, it avoids numerical issues such as scaling and overflow/underflow concerns at the expense of higher cost. Furthermore, floating-point implementation allows for an FFT co-processor to collaborate with general purpose processors. This offloads computationally intensive tasks from the primary processor. The first part of this thesis, which is devoted to FFT co-processors, proposes a new FFT architecture that uses a new Binary-Signed Digit (BSD) carry-limited adder, a new floating-point BSD multiplier and a new floating-point BSD three-operand adder. Finally, a new unit labeled as Fused-Dot-Product-Add (FDPA) is designed to compute AB+CD+E over floating-point BSD operands. The second part of the thesis discusses decimal arithmetic operations implemented in hardware using redundant number systems. These operations are popularly used in decimal floating-point co-processors. A new signed-digit decimal adder is proposed along with a sequential decimal multiplier that uses redundant number systems to increase the operational frequency of the multiplier. New redundant decimal division and square-root units are also proposed. The architectures proposed in this thesis were all implemented using Hardware-Description-Language (Verilog) and synthesized using Synopsys Design Compiler. The evaluation results prove the speed improvement of the new arithmetic units over previous pertinent works. Consequently, the FFT and decimal co-processors designed in this thesis work with at least 10% higher speed than that of previous works. These architectures are meant to fulfill the demand for the high-speed co-processors required in various applications such as multimedia services and financial computations

    Configurable Architectures For Multi-mode Floating Point Adders

    Get PDF
    published_or_final_versio

    ์–‘์žํ™”๋œ ํ•™์Šต์„ ํ†ตํ•œ ์ €์ „๋ ฅ ๋”ฅ๋Ÿฌ๋‹ ํ›ˆ๋ จ ๊ฐ€์†๊ธฐ ์„ค๊ณ„

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ(๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต๋Œ€ํ•™์› : ์œตํ•ฉ๊ณผํ•™๊ธฐ์ˆ ๋Œ€ํ•™์› ์œตํ•ฉ๊ณผํ•™๋ถ€(์ง€๋Šฅํ˜•์œตํ•ฉ์‹œ์Šคํ…œ์ „๊ณต), 2022.2. ์ „๋™์„.๋”ฅ๋Ÿฌ๋‹์˜ ์‹œ๋Œ€๊ฐ€ ๋„๋ž˜ํ•จ์— ๋”ฐ๋ผ, ์‹ฌ์ธต ์ธ๊ณต ์‹ ๊ฒฝ๋ง (DNN)์„ ์ฒ˜๋ฆฌํ•˜๊ธฐ ์œ„ํ•ด ์š”๊ตฌ๋˜๋Š” ํ•™์Šต ๋ฐ ์ถ”๋ก  ์—ฐ์‚ฐ๋Ÿ‰ ๋˜ํ•œ ๊ธฐํ•˜๊ธ‰์ˆ˜์ ์œผ๋กœ ์ฆ๊ฐ€ํ•˜์˜€๋‹ค. ๋”ฅ ๋Ÿฌ๋‹ ์‹œ๋Œ€์˜ ๋„๋ž˜์™€ ํ•จ๊ป˜ ๋‹ค์–‘ํ•œ ์ž‘์—…์— ๋Œ€ํ•œ ์‹ ๊ฒฝ๋ง ํ›ˆ๋ จ ๋ฐ ํŠน์ • ์šฉ๋„์— ๋Œ€ํ•ด ํ›ˆ๋ จ๋œ ์‹ ๊ฒฝ๋ง ์ถ”๋ก  ์ˆ˜ํ–‰ ์ธก๋ฉด์—์„œ ์‹ฌ์ธต ์‹ ๊ฒฝ๋ง (DNN) ์ฒ˜๋ฆฌ์— ๋Œ€ํ•œ ์ปดํ“จํŒ… ์š”๊ตฌ๊ฐ€ ๊ทน์ ์œผ๋กœ ์ฆ๊ฐ€ํ•˜์˜€์œผ๋ฉฐ, ์ด๋Ÿฌํ•œ ์ถ”์„ธ๋Š” ์ธ๊ณต์ง€๋Šฅ์˜ ์‚ฌ์šฉ์ด ๋”์šฑ ๋ฒ”์šฉ์ ์œผ๋กœ ์ง„ํ™”ํ•จ์— ๋”ฐ๋ผ ๋”์šฑ ๊ฐ€์†ํ™” ๋  ๊ฒƒ์œผ๋กœ ์˜ˆ์ƒ๋œ๋‹ค. ์ด๋Ÿฌํ•œ ์—ฐ์‚ฐ ์š”๊ตฌ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ๋ฐ์ดํ„ฐ ์„ผํ„ฐ ๋‚ด๋ถ€์— ๋ฐฐ์น˜ํ•˜๊ธฐ ์œ„ํ•œ FPGA (Field-Programmable Gate Array) ๋˜๋Š” ASIC (Application-Specific Integrated Circuit) ๊ธฐ๋ฐ˜ ์‹œ์Šคํ…œ์—์„œ ์ €์ „๋ ฅ์„ ์œ„ํ•œ SoC (System-on-Chip)์˜ ๊ฐ€์† ๋ธ”๋ก์— ์ด๋ฅด๊ธฐ๊นŒ์ง€ ๋‹ค์–‘ํ•œ ๋งž์ถคํ˜• ํ•˜๋“œ์›จ์–ด๊ฐ€ ์‚ฐ์—… ๋ฐ ํ•™๊ณ„์—์„œ ์ œ์•ˆ๋˜์—ˆ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š”, ์ธ๊ณต ์‹ ๊ฒฝ๋ง์˜ ์—๋„ˆ์ง€ ํšจ์œจ์ ์ธ ํ›ˆ๋ จ ์ฒ˜๋ฆฌ๋ฅผ ์œ„ํ•œ ๋งž์ถคํ˜• ์ง‘์  ํšŒ๋กœ ํ•˜๋“œ์›จ์–ด๋ฅผ ๋ณด๋‹ค ์—๋„ˆ์ง€ ํšจ์œจ์ ์œผ๋กœ ์„ค๊ณ„ํ•  ์ˆ˜ ์žˆ๋Š” ๋‹ค์–‘ํ•œ ๋ฐฉ๋ฒ•๋ก ์„ ์ œ์•ˆํ•˜๊ณ  ์‹ค์ œ ์ €์ „๋ ฅ ์ธ๊ณต ์‹ ๊ฒฝ๋ง ํ›ˆ๋ จ ์‹œ์Šคํ…œ์„ ์„ค๊ณ„ํ•˜๊ณ  ์ œ์ž‘ํ•˜์—ฌ, ๊ทธ ํšจ์œจ์„ ํ‰๊ฐ€ํ•˜๊ณ ์ž ํ•œ๋‹ค. ํŠนํžˆ, ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์ด๋Ÿฌํ•œ ์ €์ „๋ ฅ ๊ณ ์„ฑ๋Šฅ ์„ค๊ณ„ ๋ฐฉ๋ฒ•๋ก ์„ ํฌ๊ฒŒ ์„ธ ๊ฐ€์ง€๋กœ ๋ถ„๋ฅ˜ํ•˜์—ฌ ๋ถ„์„์„ ์ง„ํ–‰ํ•˜์˜€๋‹ค. ์ด๋Ÿฌํ•œ ๋ถ„๋ฅ˜๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค. (1) ํ›ˆ๋ จ ์•Œ๊ณ ๋ฆฌ์ฆ˜. ํ‘œ์ค€์ ์œผ๋กœ ์‹ฌ์ธต ์‹ ๊ฒฝ๋ง ํ›ˆ๋ จ์€ ์—ญ์ „ํŒŒ (Back-Propagation) ์•Œ๊ณ ๋ฆฌ์ฆ˜์œผ๋กœ ์ˆ˜ํ–‰๋˜์ง€๋งŒ, ๋” ํšจ์œจ์ ์ธ ํ•˜๋“œ์›จ์–ด ๊ตฌํ˜„์„ ์œ„ํ•ด ์ŠคํŒŒ์ดํฌ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ†ต์‹ ํ•˜๋Š” ๋‰ด๋Ÿฐ์ด ์žˆ๋Š” ๋‰ด๋กœ๋ชจํ”ฝ ํ•™์Šต ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋˜๋Š” ๋น„๋Œ€์นญ ํ”ผ๋“œ๋ฐฑ ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•˜๋Š” ์ƒ๋ฌผํ•™์  ๋ชจ์‚ฌ๋„๊ฐ€ ๋†’์€ (Bio-Plausible) ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ํ™œ์šฉํ•˜์—ฌ ๋” ํšจ์œจ์ ์ธ ํ›ˆ๋ จ ์‹œ์Šคํ…œ์„ ์„ค๊ณ„ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์กฐ์‚ฌ ๋ฐ ์ œ์‹œํ•˜๊ณ , ๊ทธ ํ•˜๋“œ์›จ์–ด ํšจ์œจ์„ฑ์„ ๋ถ„์„ํ•˜์˜€๋‹ค. (2) ์ €์ •๋ฐ€๋„ ์ˆ˜ ์ฒด๊ณ„ ํ™œ์šฉ. ์ผ๋ฐ˜์ ์œผ๋กœ ์‚ฌ์šฉ๋˜๋Š” DNN ๊ฐ€์†๊ธฐ์—์„œ ํšจ์œจ์„ฑ์„ ๋†’์ด๋Š” ๊ฐ€์žฅ ๊ฐ•๋ ฅํ•œ ๋ฐฉ๋ฒ• ์ค‘ ํ•˜๋‚˜๋Š” ์ˆ˜์น˜ ์ •๋ฐ€๋„๋ฅผ ์กฐ์ •ํ•˜๋Š” ๊ฒƒ์ด๋‹ค. DNN์˜ ์ถ”๋ก  ๋‹จ๊ณ„์— ๋‚ฎ์€ ์ •๋ฐ€๋„ ์ˆซ์ž๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์€ ์ž˜ ์—ฐ๊ตฌ๋˜์—ˆ์ง€๋งŒ, ์„ฑ๋Šฅ ์ €ํ•˜ ์—†์ด DNN์„ ํ›ˆ๋ จํ•˜๋Š” ๊ฒƒ์€ ์ƒ๋Œ€์ ์œผ ๊ธฐ์ˆ ์  ์–ด๋ ค์›€์ด ์žˆ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ๋‹ค์–‘ํ•œ ๋ชจ๋ธ๊ณผ ์‹œ๋‚˜๋ฆฌ์˜ค์—์„œ DNN์„ ์„ฑ๋Šฅ ์ €ํ•˜ ์—†์ด ํ›ˆ๋ จํ•˜๊ธฐ ์œ„ํ•œ ์ƒˆ๋กœ์šด ์ˆ˜ ์ฒด๊ณ„๋ฅผ ์ œ์•ˆํ•˜์˜€๋‹ค. (3) ์‹œ์Šคํ…œ ๊ตฌํ˜„ ๊ธฐ๋ฒ•. ์ง‘์  ํšŒ๋กœ์—์„œ ๋งž์ถคํ˜• ํ›ˆ๋ จ ์‹œ์Šคํ…œ์„ ์‹ค์ œ๋กœ ์‹คํ˜„ํ•  ๋•Œ, ๊ฑฐ์˜ ๋ฌดํ•œํ•œ ์„ค๊ณ„ ๊ณต๊ฐ„์€ ์นฉ ๋‚ด๋ถ€์˜ ๋ฐ์ดํ„ฐ ํ๋ฆ„, ์‹œ์Šคํ…œ ๋ถ€ํ•˜ ๋ถ„์‚ฐ, ๊ฐ€์†/๊ฒŒ์ดํŒ… ๋ธ”๋ก ๋“ฑ ๋‹ค์–‘ํ•œ ์š”์†Œ์— ๋”ฐ๋ผ ๊ฒฐ๊ณผ์˜ ํ’ˆ์งˆ์ด ํฌ๊ฒŒ ๋‹ฌ๋ผ์งˆ ์ˆ˜ ์žˆ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ๋” ๋‚˜์€ ์„ฑ๋Šฅ๊ณผ ํšจ์œจ์„ฑ์œผ๋กœ ์ด์–ด์ง€๋Š” ๋‹ค์–‘ํ•œ ์„ค๊ณ„ ๊ธฐ๋ฒ•์„ ์†Œ๊ฐœํ•˜๊ณ  ๋ถ„์„ํ•˜๊ณ ์ž ํ•œ๋‹ค. ์ฒซ์งธ๋กœ, ์†๊ธ€์”จ ๋ถ„๋ฅ˜ ํ•™์Šต์„ ์œ„ํ•œ ๋‰ด๋กœ๋ชจํ”ฝ ํ•™์Šต ์‹œ์Šคํ…œ์„ ์ œ์ž‘ํ•˜์—ฌ ํ‰๊ฐ€ํ•˜์˜€๋‹ค. ์ด ํ•™์Šต ์‹œ์Šคํ…œ์€ ์ „ํ†ต์ ์ธ ๊ธฐ๊ณ„ ํ•™์Šต์˜ ํ›ˆ๋ จ ์„ฑ๋Šฅ์„ ์œ ์ง€ํ•˜๋ฉด์„œ ๋‚ฎ์€ ํ›ˆ๋ จ ์˜ค๋ฒ„ํ—ค๋“œ๋ฅผ ์ œ๊ณตํ•˜๋Š” ๊ฒƒ์„ ๋ชฉํ‘œ๋กœ ํ•˜์—ฌ ์„ค๊ณ„๋˜์—ˆ๋‹ค. ์ด ๋ชฉ์ ์„ ๋‹ฌ์„ฑํ•˜๊ธฐ ์œ„ํ•ด, ๋” ์ ์€ ์—ฐ์‚ฐ ์š”๊ตฌ๋Ÿ‰๊ณผ ๋ฒ„ํผ ๋ฉ”๋ชจ๋ฆฌ ํ•„์š”์น˜๋ฅผ ์œ„ํ•ด ๊ธฐ์กด์˜ ๋‰ด๋กœ๋ชจํ”ฝ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์ˆ˜์ •ํ•˜์˜€์œผ๋ฉฐ, ์ด ๊ณผ์ •์—์„œ ํ›ˆ๋ จ ์„ฑ๋Šฅ ์†์‹ค ์—†์ด ๊ธฐ์กด ์—ญ์ „ํŒŒ ๊ธฐ๋ฐ˜ ์•Œ๊ณ ๋ฆฌ์ฆ˜์— ๊ทผ์ ‘ํ•œ ํ›ˆ๋ จ ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ•˜์˜€๋‹ค. ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ, ์—…๋ฐ์ดํŠธ๋ฅผ ๊ฑด๋„ˆ๋›ฐ๋Š” ๋ฉ”์ปค๋‹ˆ์ฆ˜์„ ๊ตฌํ˜„ํ•˜๊ณ  Lock-Free ๋งค๊ฐœ๋ณ€์ˆ˜ ์—…๋ฐ์ดํŠธ ๋ฐฉ์‹์„ ์ฑ„ํƒํ•˜์—ฌ ํ›ˆ๋ จ์— ์†Œ๋ชจ๋˜๋Š” ์—๋„ˆ์ง€๋ฅผ ํ›ˆ๋ จ์ด ์ง„ํ–‰๋จ์— ๋”ฐ๋ผ ๋™์ ์œผ๋กœ ๊ฐ์†Œ์‹œํ‚ฌ ์ˆ˜ ์žˆ๋Š” ์‹œ์Šคํ…œ ๊ตฌํ˜„ ๊ธฐ๋ฒ• ๋˜ํ•œ ์†Œ๊ฐœํ•˜๊ณ  ๊ทธ ์„ฑ๋Šฅ์„ ๋ถ„์„ํ•˜์˜€๋‹ค. ์ด๋Ÿฐ ๊ธฐ๋ฒ•์„ ํ†ตํ•ด, ์ด ํ•™์Šต ์‹œ์Šคํ…œ์€ ๊ธฐ์กด์˜ ํ›ˆ๋ จ ์‹œ์Šคํ…œ ๋Œ€๋น„ ๋›ฐ์–ด๋‚œ ๋ถ„๋ฅ˜ ์„ฑ๋Šฅ-์—๋„ˆ์ง€ ์†Œ๋ชจ๋Ÿ‰ ๊ด€๊ณ„๋ฅผ ๋ณด์ด๋ฉด์„œ๋„ ๊ธฐ์กด์˜ ์—ญ์ „ํŒŒ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๊ธฐ๋ฐ˜์˜ ์ธ๊ณต ์‹ ๊ฒฝ๋ง์˜ ํ›ˆ๋ จ ์„ฑ๋Šฅ์„ ์œ ์ง€ํ•˜์˜€๋‹ค. ๋‘˜์งธ๋กœ, ํŠน์ˆ˜ ๋ช…๋ น์–ด ์ฒด๊ณ„ ๋ฐ ๋งž์ถคํ˜• ์ˆ˜ ์ฒด๊ณ„๋ฅผ ํ™œ์šฉํ•œ ํ”„๋กœ๊ทธ๋žจ ๊ฐ€๋Šฅํ•œ DNN ํ›ˆ๋ จ์šฉ ํ”„๋กœ์„ธ์„œ๊ฐ€ ์„ค๊ณ„๋˜๊ณ  ์ œ์ž‘๋˜์—ˆ๋‹ค. ๊ธฐ์กด DNN ์ถ”๋ก ์šฉ ๊ฐ€์†๊ธฐ๋Š” 8๋น„ํŠธ ์ •์ˆ˜ ๊ธฐ๋ฐ˜์œผ๋กœ ์ด๋ฃจ์–ด์ง„ ๊ฒฝ์šฐ๊ฐ€ ๋งŽ์•˜์ง€๋งŒ, DNN ํ•™์Šต ์„ค๊ณ„์‹œ 8๋น„ํŠธ ์ˆ˜ ์ฒด๊ณ„๋ฅผ ์ด์šฉํ•˜๋ฉฐ ํ›ˆ๋ จ ์„ฑ๋Šฅ ์ €ํ•˜๋ฅผ ๋ณด์ด์ง€ ์•Š๋Š” ๊ฒƒ์€ ์ƒ๋‹นํ•œ ๊ธฐ์ˆ ์  ๋‚œ์ด๋„๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ์—ˆ๋‹ค. ์ด๋Ÿฐ ๋ฌธ์ œ๋ฅผ ๊ทน๋ณตํ•˜๊ธฐ ์œ„ํ•ด, ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ๊ณต์œ ํ˜• ๋ฉฑ์ง€์ˆ˜ ํŽธํ–ฅ๊ฐ’์„ ํ™œ์šฉํ•˜๋Š” 8๋น„ํŠธ ๋ถ€๋™ ์†Œ์ˆ˜์  ์ˆ˜ ์ฒด๊ณ„๋ฅผ ์ƒˆ๋กœ์ด ์ œ์•ˆํ•˜์˜€์œผ๋ฉฐ, ์ด ์ˆ˜ ์ฒด๊ณ„์˜ ํšจ์šฉ์„ฑ์„ ๋ณด์ด๊ธฐ ์œ„ํ•ด ์ด DNN ํ›ˆ๋ จ ํ”„๋กœ์„ธ์„œ๊ฐ€ ์„ค๊ณ„๋˜์—ˆ๋‹ค. ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ, ์ด ํ”„๋กœ์„ธ์„œ๋Š” ๋‹จ์ˆœํ•œ MAC ๊ธฐ๋ฐ˜ Matrix-Multiplication ๊ฐ€์†๊ธฐ๊ฐ€ ์•„๋‹Œ, Fused-Multiply-Add ํŠธ๋ฆฌ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•˜๋Š” ์—๋„ˆ์ง€ ํšจ์œจ์ ์ธ ๊ฐ€์†๊ธฐ ๊ตฌ์กฐ๋ฅผ ์ฑ„ํƒํ•˜๋ฉด์„œ๋„, ์นฉ ๋‚ด๋ถ€์—์„œ์˜ ๋ฐ์ดํ„ฐ ์ด๋™๋Ÿ‰ ์ตœ์ ํ™” ๋ฐ ์ปจ๋ณผ๋ฃจ์…˜์˜ ๊ณต๊ฐ„์„ฑ์„ ๊ทน๋Œ€ํ™”ํ•  ์ˆ˜ ์žˆ๊ธฐ ์œ„ํ•ด ๋ฐ์ดํ„ฐ ์ „๋‹ฌ ์œ ๋‹›์„ ์ž…์ถœ๋ ฅ๋ถ€์— 2D๋กœ ์ œ์ž‘ํ•˜์—ฌ ํŠธ๋ฆฌ ๊ธฐ๋ฐ˜์—์„œ์˜ ์ปจ๋ณผ๋ฃจ์…˜ ์ถ”๋ก  ๋ฐ ํ›ˆ๋ จ ๋‹จ๊ณ„์—์„œ์˜ ๊ณต๊ฐ„์„ฑ์„ ํ™œ์šฉํ•  ์ˆ˜ ์žˆ๋Š” ๋ฐฉ๋ฒ•์„ ์ œ์‹œํ•˜์˜€๋‹ค. ๋ณธ DNN ํ›ˆ๋ จ ํ”„๋กœ์„ธ์„œ๋Š” ๋งž์ถคํ˜• ๋ฒกํ„ฐ ์—ฐ์‚ฐ๊ธฐ, ๊ฐ€์† ๋ช…๋ น์–ด ์ฒด๊ณ„, ์™ธ๋ถ€ DRAM์œผ๋กœ์˜ ์ง์ ‘์ ์ธ ์ ‘๊ทผ ์ œ์–ด ๋ฐฉ์‹ ๋“ฑ์„ ํ†ตํ•ด ํ•œ ํ”„๋กœ์„ธ์„œ ๋‚ด์—์„œ DNN ํ›ˆ๋ จ์˜ ๋ชจ๋“  ๋‹จ๊ณ„๋ฅผ ๋‹ค์–‘ํ•œ ๋ชจ๋ธ ๋ฐ ํ™˜๊ฒฝ์—์„œ ํšจ์œจ์ ์œผ๋กœ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ๋„๋ก ์„ค๊ณ„๋˜์—ˆ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด ๋ณธ ํ”„๋กœ์„ธ์„œ๋Š” ๊ธฐ์กด์˜ ์—ฐ๊ตฌ์—์„œ ์ œ์‹œ๋˜์—ˆ๋˜ ๋‹ค๋ฅธ ํ”„๋กœ์„ธ์„œ์— ๋น„ํ•ด ๋™์ผ ๋ชจ๋ธ์„ ์ฒ˜๋ฆฌํ•˜๋ฉด์„œ 2.48๋ฐฐ ๊ฐ€๋Ÿ‰ ๋” ๋†’์€ ์—๋„ˆ์ง€ ํšจ์œจ์„ฑ, 43% ์ ์€ DRAM ์ ‘๊ทผ ์š”๊ตฌ๋Ÿ‰, 0.8%p ๋†’์€ ํ›ˆ๋ จ ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ•˜์˜€๋‹ค. ์ด๋ ‡๊ฒŒ ์†Œ๊ฐœ๋œ ๋‘ ๊ฐ€์ง€ ์„ค๊ณ„๋Š” ๋ชจ๋‘ ์‹ค์ œ ์นฉ์œผ๋กœ ์ œ์ž‘๋˜์–ด ๊ฒ€์ฆ๋˜์—ˆ๋‹ค. ์ธก์ • ๋ฐ์ดํ„ฐ ๋ฐ ์ „๋ ฅ ์†Œ๋ชจ๋Ÿ‰์„ ํ†ตํ•ด ๋ณธ ๋…ผ๋ฌธ์—์„œ ์ œ์•ˆ๋œ ์ €์ „๋ ฅ ๋”ฅ๋Ÿฌ๋‹ ํ›ˆ๋ จ ์‹œ์Šคํ…œ ์„ค๊ณ„ ๊ธฐ๋ฒ•์˜ ํšจ์œจ์„ ๊ฒ€์ฆํ•˜์˜€์œผ๋ฉฐ, ํŠนํžˆ ์ƒ๋ฌผํ•™์  ๋ชจ์‚ฌ๋„๊ฐ€ ๋†’์€ ํ›ˆ๋ จ ์•Œ๊ณ ๋ฆฌ์ฆ˜, ๋”ฅ๋Ÿฌ๋‹ ํ›ˆ๋ จ์— ์ตœ์ ํ™”๋œ ์ˆ˜ ์ฒด๊ณ„, ๊ทธ๋ฆฌ๊ณ  ํšจ์œจ์ ์ธ ์‹œ์Šคํ…œ ๊ตฌํ˜„ ๊ธฐ๋ฒ•์„ ํ™œ์šฉํ•˜์—ฌ ์‹œ์Šคํ…œ์˜ ์—๋„ˆ์ง€ ํšจ์œจ์„ฑ์„ ๊ฐœ์„ ํ•˜๋Š” ๋ชฉํ‘œ๋ฅผ ๋‹ฌ์„ฑํ•˜์˜€๋Š”์ง€ ์ •๋Ÿ‰์ ์œผ๋กœ ๋ถ„์„ํ•˜์˜€๋‹ค.With the advent of the deep learning era, the computational need for processing deep neural networks (DNN) have increased dramatically, both in terms of performing training the neural networks on various tasks as well as in performing inference on the trained neural networks for specific use cases. To address those needs, many custom hardware ranging from systems based on field-programmable gate arrays (FPGA) or application-specific integrated circuits (ASIC) for deployment inside data centers to acceleration blocks in system-on-chip (SoC) for low-power processing in mobile devices were proposed. In this dissertation, custom integrated circuits hardware for energy efficient processing of training neural networks are designed, fabricated, and measured for evaluation of different methodologies that could be utilized for more energy efficient processing under same training performance constraints. In particular, these methodologies are categorized to three different categories for evaluation: (1) Training algorithm. While standard deep neural network training is performed with the back-propagation (BP) algorithm, we investigate various training algorithms, such as neuromorphic learning algorithms with spiking neurons or bio-plausible algorithms with asymmetric feedback for exploiting computational properties for more efficient hardware implementation. (2) Low-precision arithmetic. One of the most powerful methods for increased efficiency in DNN accelerators is through scaling numerical precision. While utilizing low precision numerics for inference phase of DNNs is well studied, training DNNs without performance degradation is relatively more challenging. A novel numerical scheme for training DNNs in various models and scenarios is proposed in this dissertation. (3) System implementation techniques. In actual realization of a custom training system in integrated circuits, nearly infinite design space leads to vastly different quality of results depending on dataflow inside the chip, system load balancing, acceleration and gating blocks, et cetera. Different design techniques which leads to better performance and efficiency are introduced in this dissertation. First, a neuromorphic learning system for classifying handwritten digits (MNIST) is introduced. This learning system aims to deliver low training overhead while maintaining the training performance of classical machine learning. In order to achieve this goal, a neuromorphic learning algorithm is modified for lower operation count and memory buffer requirement while maintaining or even obtaining higher machine learning performance. Moreover, implementation techniques such as update skipping mechanism and lock-free parameter updates allow even lower training overhead, dynamically reducing training energy overhead from 25.6% to 7.5%. With these proposed methodologies, this system greatly improves the accuracy-energy trade-off in on-chip learning system as well as showing close learning performance to classical DNN training through back propagation. Second, a programmable DNN training processor with a custom numerical format is introduced. While prior DNN inference accelerators have utilized 8-bit integers, implementing 8-bit numerics for a training accelerator remained to be a challenge due to higher precision requirements in the backward step of DNN training. To overcome this limitation, a custom 8-bit floating point format dubbed 8-bit floating point with shared exponent bias (FP8-SEB) is introduced in this dissertation. Moreover, a processing architecture of 24-way fused-multiply-adder (FMA) tree greatly increases processing energy efficiency per MAC, while complemented with a novel 2-dimensional routing data-path for making use of spatiality to increase data reuse in both forward, backward, and weight gradient step of convolutional neural networks. This DNN training processor is implemented with a custom vector processing unit, acceleration instructions, and DMA in external DRAMs for end-to-end DNN training in various models and datasets. Compared against prior low-precision training processor in ResNet-18 training, this work achieves 2.48ร— higher energy efficiency, 43% less DRAM accesses, and 0.8\p higher training accuracy. Both of the designs introduced are fabricated in real silicon and verified both in simulations and in physical measurements. Design methodologies are carefully evaluated using simulations of the fabricated chip and measurements with monitored data and power consumption under varying conditions that expose the design techniques in effect. The efficiency of various biologically plausible algorithms, novel numerical formats, and system implementation techniques are analyzed in discussed in this dissertations based on the obtained measurements.Abstract i Contents iv List of Tables vii List of Figures viii 1 Introduction 1 1.1 Study Background 1 1.2 Purpose of Research 6 1.3 Contents 8 2 Hardware-Friendly Learning Algorithms 9 2.1 Modified Learning Rule for Neuromorphic System 9 2.1.1 The Segregated Dendrites Algorithm 9 2.1.2 Modification of the Segregated Dendrites Algorithm 13 2.2 Non-BP Learning Rules on DNN Training Processor 18 2.2.1 Feedback Alignment and Direct Feedback Alignment 18 2.2.2 Reduced Memory Access in Non-BP Learning Rules 23 3 Optimal Numerical Format for DNN Training 27 3.1 Related Works 27 3.2 Proposed FP8 with Shared Exponent Bias 30 3.3 Training Results with FP8-SEB 33 3.4 Fused Multiply Adder Tree for FP8-SEB 37 4 System Implementations 41 4.1 Neuromorphic Learning System 41 4.1.1 Bio-Plausibility 41 4.1.2 Top Level Architecture 43 4.1.3 Lock-Free Weight Updates 47 4.1.4 Update Skipping Mechanism 48 4.2 Low-Precision DNN Training System 51 4.2.1 Top Level Architecture 52 4.2.2 Optimized Auxiliary Instructions in the Vector Processing Unit 55 4.2.3 Buffer Organization 57 4.2.4 Input-Output 2D Spatial Routing for FMA Trees 60 5 Measurement Results 70 5.1 Measurement Results on the Neuromorphic Learning System 70 5.1.1 Measurement Results and Test Setup . 70 5.1.2 Comparison against other works 73 5.1.3 Scalability of the Learning Algorithm 77 5.2 Measurements Results on the Low-Precision DNN Training Processor 79 5.2.1 Measurement Results in Benchmarked Tests 79 5.2.2 Comparison Against Other DNN Training Processors 89 6 Conclusion 93 6.1 Discussion for Future Works 93 6.1.1 Scaling to CNNs in the Neuromorphic System 93 6.1.2 Discussions for Improvements on DNN Training Processor 96 6.2 Conclusion 99 Abstract (In Korean) 108๋ฐ•
    • โ€ฆ
    corecore