774 research outputs found

    Parallel Architectures and Parallel Algorithms for Integrated Vision Systems

    Get PDF
    Computer vision is regarded as one of the most complex and computationally intensive problems. An integrated vision system (IVS) is a system that uses vision algorithms from all levels of processing to perform for a high level application (e.g., object recognition). An IVS normally involves algorithms from low level, intermediate level, and high level vision. Designing parallel architectures for vision systems is of tremendous interest to researchers. Several issues are addressed in parallel architectures and parallel algorithms for integrated vision systems

    GRAPE-6: The massively-parallel special-purpose computer for astrophysical particle simulation

    Full text link
    In this paper, we describe the architecture and performance of the GRAPE-6 system, a massively-parallel special-purpose computer for astrophysical NN-body simulations. GRAPE-6 is the successor of GRAPE-4, which was completed in 1995 and achieved the theoretical peak speed of 1.08 Tflops. As was the case with GRAPE-4, the primary application of GRAPE-6 is simulation of collisional systems, though it can be used for collisionless systems. The main differences between GRAPE-4 and GRAPE-6 are (a) The processor chip of GRAPE-6 integrates 6 force-calculation pipelines, compared to one pipeline of GRAPE-4 (which needed 3 clock cycles to calculate one interaction), (b) the clock speed is increased from 32 to 90 MHz, and (c) the total number of processor chips is increased from 1728 to 2048. These improvements resulted in the peak speed of 64 Tflops. We also discuss the design of the successor of GRAPE-6.Comment: Accepted for publication in PASJ, scheduled to appear in Vol. 55, No.

    The S2 VLBI Correlator: A Correlator for Space VLBI and Geodetic Signal Processing

    Get PDF
    We describe the design of a correlator system for ground and space-based VLBI. The correlator contains unique signal processing functions: flexible LO frequency switching for bandwidth synthesis; 1 ms dump intervals, multi-rate digital signal-processing techniques to allow correlation of signals at different sample rates; and a digital filter for very high resolution cross-power spectra. It also includes autocorrelation, tone extraction, pulsar gating, signal-statistics accumulation.Comment: 44 pages, 13 figure

    Implementation of JPEG compression and motion estimation on FPGA hardware

    Full text link
    A hardware implementation of JPEG allows for real-time compression in data intensivve applications, such as high speed scanning, medical imaging and satellite image transmission. Implementation options include dedicated DSP or media processors, FPGA boards, and ASICs. Factors that affect the choice of platform selection involve cost, speed, memory, size, power consumption, and case of reconfiguration. The proposed hardware solution is based on a Very high speed integrated circuit Hardware Description Language (VHDL) implememtation of the codec with prefered realization using an FPGA board due to speed, cost and flexibility factors; The VHDL language is commonly used to model hardware impletations from a top down perspective. The VHDL code may be simulated to correct mistakes and subsequently synthesized into hardware using a synthesis tool, such as the xilinx ise suite. The same VHDL code may be synthesized into a number of sifferent hardware architetcures based on constraints given. For example speed was the major constraint when synthesizing the pipeline of jpeg encoding and decoding, while chip area and power consumption were primary constraints when synthesizing the on-die memory because of large area. Thus, there is a trade off between area and speed in logic synthesis

    ์–‘์žํ™”๋œ ํ•™์Šต์„ ํ†ตํ•œ ์ €์ „๋ ฅ ๋”ฅ๋Ÿฌ๋‹ ํ›ˆ๋ จ ๊ฐ€์†๊ธฐ ์„ค๊ณ„

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ(๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต๋Œ€ํ•™์› : ์œตํ•ฉ๊ณผํ•™๊ธฐ์ˆ ๋Œ€ํ•™์› ์œตํ•ฉ๊ณผํ•™๋ถ€(์ง€๋Šฅํ˜•์œตํ•ฉ์‹œ์Šคํ…œ์ „๊ณต), 2022.2. ์ „๋™์„.๋”ฅ๋Ÿฌ๋‹์˜ ์‹œ๋Œ€๊ฐ€ ๋„๋ž˜ํ•จ์— ๋”ฐ๋ผ, ์‹ฌ์ธต ์ธ๊ณต ์‹ ๊ฒฝ๋ง (DNN)์„ ์ฒ˜๋ฆฌํ•˜๊ธฐ ์œ„ํ•ด ์š”๊ตฌ๋˜๋Š” ํ•™์Šต ๋ฐ ์ถ”๋ก  ์—ฐ์‚ฐ๋Ÿ‰ ๋˜ํ•œ ๊ธฐํ•˜๊ธ‰์ˆ˜์ ์œผ๋กœ ์ฆ๊ฐ€ํ•˜์˜€๋‹ค. ๋”ฅ ๋Ÿฌ๋‹ ์‹œ๋Œ€์˜ ๋„๋ž˜์™€ ํ•จ๊ป˜ ๋‹ค์–‘ํ•œ ์ž‘์—…์— ๋Œ€ํ•œ ์‹ ๊ฒฝ๋ง ํ›ˆ๋ จ ๋ฐ ํŠน์ • ์šฉ๋„์— ๋Œ€ํ•ด ํ›ˆ๋ จ๋œ ์‹ ๊ฒฝ๋ง ์ถ”๋ก  ์ˆ˜ํ–‰ ์ธก๋ฉด์—์„œ ์‹ฌ์ธต ์‹ ๊ฒฝ๋ง (DNN) ์ฒ˜๋ฆฌ์— ๋Œ€ํ•œ ์ปดํ“จํŒ… ์š”๊ตฌ๊ฐ€ ๊ทน์ ์œผ๋กœ ์ฆ๊ฐ€ํ•˜์˜€์œผ๋ฉฐ, ์ด๋Ÿฌํ•œ ์ถ”์„ธ๋Š” ์ธ๊ณต์ง€๋Šฅ์˜ ์‚ฌ์šฉ์ด ๋”์šฑ ๋ฒ”์šฉ์ ์œผ๋กœ ์ง„ํ™”ํ•จ์— ๋”ฐ๋ผ ๋”์šฑ ๊ฐ€์†ํ™” ๋  ๊ฒƒ์œผ๋กœ ์˜ˆ์ƒ๋œ๋‹ค. ์ด๋Ÿฌํ•œ ์—ฐ์‚ฐ ์š”๊ตฌ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ๋ฐ์ดํ„ฐ ์„ผํ„ฐ ๋‚ด๋ถ€์— ๋ฐฐ์น˜ํ•˜๊ธฐ ์œ„ํ•œ FPGA (Field-Programmable Gate Array) ๋˜๋Š” ASIC (Application-Specific Integrated Circuit) ๊ธฐ๋ฐ˜ ์‹œ์Šคํ…œ์—์„œ ์ €์ „๋ ฅ์„ ์œ„ํ•œ SoC (System-on-Chip)์˜ ๊ฐ€์† ๋ธ”๋ก์— ์ด๋ฅด๊ธฐ๊นŒ์ง€ ๋‹ค์–‘ํ•œ ๋งž์ถคํ˜• ํ•˜๋“œ์›จ์–ด๊ฐ€ ์‚ฐ์—… ๋ฐ ํ•™๊ณ„์—์„œ ์ œ์•ˆ๋˜์—ˆ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š”, ์ธ๊ณต ์‹ ๊ฒฝ๋ง์˜ ์—๋„ˆ์ง€ ํšจ์œจ์ ์ธ ํ›ˆ๋ จ ์ฒ˜๋ฆฌ๋ฅผ ์œ„ํ•œ ๋งž์ถคํ˜• ์ง‘์  ํšŒ๋กœ ํ•˜๋“œ์›จ์–ด๋ฅผ ๋ณด๋‹ค ์—๋„ˆ์ง€ ํšจ์œจ์ ์œผ๋กœ ์„ค๊ณ„ํ•  ์ˆ˜ ์žˆ๋Š” ๋‹ค์–‘ํ•œ ๋ฐฉ๋ฒ•๋ก ์„ ์ œ์•ˆํ•˜๊ณ  ์‹ค์ œ ์ €์ „๋ ฅ ์ธ๊ณต ์‹ ๊ฒฝ๋ง ํ›ˆ๋ จ ์‹œ์Šคํ…œ์„ ์„ค๊ณ„ํ•˜๊ณ  ์ œ์ž‘ํ•˜์—ฌ, ๊ทธ ํšจ์œจ์„ ํ‰๊ฐ€ํ•˜๊ณ ์ž ํ•œ๋‹ค. ํŠนํžˆ, ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์ด๋Ÿฌํ•œ ์ €์ „๋ ฅ ๊ณ ์„ฑ๋Šฅ ์„ค๊ณ„ ๋ฐฉ๋ฒ•๋ก ์„ ํฌ๊ฒŒ ์„ธ ๊ฐ€์ง€๋กœ ๋ถ„๋ฅ˜ํ•˜์—ฌ ๋ถ„์„์„ ์ง„ํ–‰ํ•˜์˜€๋‹ค. ์ด๋Ÿฌํ•œ ๋ถ„๋ฅ˜๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค. (1) ํ›ˆ๋ จ ์•Œ๊ณ ๋ฆฌ์ฆ˜. ํ‘œ์ค€์ ์œผ๋กœ ์‹ฌ์ธต ์‹ ๊ฒฝ๋ง ํ›ˆ๋ จ์€ ์—ญ์ „ํŒŒ (Back-Propagation) ์•Œ๊ณ ๋ฆฌ์ฆ˜์œผ๋กœ ์ˆ˜ํ–‰๋˜์ง€๋งŒ, ๋” ํšจ์œจ์ ์ธ ํ•˜๋“œ์›จ์–ด ๊ตฌํ˜„์„ ์œ„ํ•ด ์ŠคํŒŒ์ดํฌ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ†ต์‹ ํ•˜๋Š” ๋‰ด๋Ÿฐ์ด ์žˆ๋Š” ๋‰ด๋กœ๋ชจํ”ฝ ํ•™์Šต ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋˜๋Š” ๋น„๋Œ€์นญ ํ”ผ๋“œ๋ฐฑ ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•˜๋Š” ์ƒ๋ฌผํ•™์  ๋ชจ์‚ฌ๋„๊ฐ€ ๋†’์€ (Bio-Plausible) ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ํ™œ์šฉํ•˜์—ฌ ๋” ํšจ์œจ์ ์ธ ํ›ˆ๋ จ ์‹œ์Šคํ…œ์„ ์„ค๊ณ„ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์กฐ์‚ฌ ๋ฐ ์ œ์‹œํ•˜๊ณ , ๊ทธ ํ•˜๋“œ์›จ์–ด ํšจ์œจ์„ฑ์„ ๋ถ„์„ํ•˜์˜€๋‹ค. (2) ์ €์ •๋ฐ€๋„ ์ˆ˜ ์ฒด๊ณ„ ํ™œ์šฉ. ์ผ๋ฐ˜์ ์œผ๋กœ ์‚ฌ์šฉ๋˜๋Š” DNN ๊ฐ€์†๊ธฐ์—์„œ ํšจ์œจ์„ฑ์„ ๋†’์ด๋Š” ๊ฐ€์žฅ ๊ฐ•๋ ฅํ•œ ๋ฐฉ๋ฒ• ์ค‘ ํ•˜๋‚˜๋Š” ์ˆ˜์น˜ ์ •๋ฐ€๋„๋ฅผ ์กฐ์ •ํ•˜๋Š” ๊ฒƒ์ด๋‹ค. DNN์˜ ์ถ”๋ก  ๋‹จ๊ณ„์— ๋‚ฎ์€ ์ •๋ฐ€๋„ ์ˆซ์ž๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์€ ์ž˜ ์—ฐ๊ตฌ๋˜์—ˆ์ง€๋งŒ, ์„ฑ๋Šฅ ์ €ํ•˜ ์—†์ด DNN์„ ํ›ˆ๋ จํ•˜๋Š” ๊ฒƒ์€ ์ƒ๋Œ€์ ์œผ ๊ธฐ์ˆ ์  ์–ด๋ ค์›€์ด ์žˆ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ๋‹ค์–‘ํ•œ ๋ชจ๋ธ๊ณผ ์‹œ๋‚˜๋ฆฌ์˜ค์—์„œ DNN์„ ์„ฑ๋Šฅ ์ €ํ•˜ ์—†์ด ํ›ˆ๋ จํ•˜๊ธฐ ์œ„ํ•œ ์ƒˆ๋กœ์šด ์ˆ˜ ์ฒด๊ณ„๋ฅผ ์ œ์•ˆํ•˜์˜€๋‹ค. (3) ์‹œ์Šคํ…œ ๊ตฌํ˜„ ๊ธฐ๋ฒ•. ์ง‘์  ํšŒ๋กœ์—์„œ ๋งž์ถคํ˜• ํ›ˆ๋ จ ์‹œ์Šคํ…œ์„ ์‹ค์ œ๋กœ ์‹คํ˜„ํ•  ๋•Œ, ๊ฑฐ์˜ ๋ฌดํ•œํ•œ ์„ค๊ณ„ ๊ณต๊ฐ„์€ ์นฉ ๋‚ด๋ถ€์˜ ๋ฐ์ดํ„ฐ ํ๋ฆ„, ์‹œ์Šคํ…œ ๋ถ€ํ•˜ ๋ถ„์‚ฐ, ๊ฐ€์†/๊ฒŒ์ดํŒ… ๋ธ”๋ก ๋“ฑ ๋‹ค์–‘ํ•œ ์š”์†Œ์— ๋”ฐ๋ผ ๊ฒฐ๊ณผ์˜ ํ’ˆ์งˆ์ด ํฌ๊ฒŒ ๋‹ฌ๋ผ์งˆ ์ˆ˜ ์žˆ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ๋” ๋‚˜์€ ์„ฑ๋Šฅ๊ณผ ํšจ์œจ์„ฑ์œผ๋กœ ์ด์–ด์ง€๋Š” ๋‹ค์–‘ํ•œ ์„ค๊ณ„ ๊ธฐ๋ฒ•์„ ์†Œ๊ฐœํ•˜๊ณ  ๋ถ„์„ํ•˜๊ณ ์ž ํ•œ๋‹ค. ์ฒซ์งธ๋กœ, ์†๊ธ€์”จ ๋ถ„๋ฅ˜ ํ•™์Šต์„ ์œ„ํ•œ ๋‰ด๋กœ๋ชจํ”ฝ ํ•™์Šต ์‹œ์Šคํ…œ์„ ์ œ์ž‘ํ•˜์—ฌ ํ‰๊ฐ€ํ•˜์˜€๋‹ค. ์ด ํ•™์Šต ์‹œ์Šคํ…œ์€ ์ „ํ†ต์ ์ธ ๊ธฐ๊ณ„ ํ•™์Šต์˜ ํ›ˆ๋ จ ์„ฑ๋Šฅ์„ ์œ ์ง€ํ•˜๋ฉด์„œ ๋‚ฎ์€ ํ›ˆ๋ จ ์˜ค๋ฒ„ํ—ค๋“œ๋ฅผ ์ œ๊ณตํ•˜๋Š” ๊ฒƒ์„ ๋ชฉํ‘œ๋กœ ํ•˜์—ฌ ์„ค๊ณ„๋˜์—ˆ๋‹ค. ์ด ๋ชฉ์ ์„ ๋‹ฌ์„ฑํ•˜๊ธฐ ์œ„ํ•ด, ๋” ์ ์€ ์—ฐ์‚ฐ ์š”๊ตฌ๋Ÿ‰๊ณผ ๋ฒ„ํผ ๋ฉ”๋ชจ๋ฆฌ ํ•„์š”์น˜๋ฅผ ์œ„ํ•ด ๊ธฐ์กด์˜ ๋‰ด๋กœ๋ชจํ”ฝ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์ˆ˜์ •ํ•˜์˜€์œผ๋ฉฐ, ์ด ๊ณผ์ •์—์„œ ํ›ˆ๋ จ ์„ฑ๋Šฅ ์†์‹ค ์—†์ด ๊ธฐ์กด ์—ญ์ „ํŒŒ ๊ธฐ๋ฐ˜ ์•Œ๊ณ ๋ฆฌ์ฆ˜์— ๊ทผ์ ‘ํ•œ ํ›ˆ๋ จ ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ•˜์˜€๋‹ค. ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ, ์—…๋ฐ์ดํŠธ๋ฅผ ๊ฑด๋„ˆ๋›ฐ๋Š” ๋ฉ”์ปค๋‹ˆ์ฆ˜์„ ๊ตฌํ˜„ํ•˜๊ณ  Lock-Free ๋งค๊ฐœ๋ณ€์ˆ˜ ์—…๋ฐ์ดํŠธ ๋ฐฉ์‹์„ ์ฑ„ํƒํ•˜์—ฌ ํ›ˆ๋ จ์— ์†Œ๋ชจ๋˜๋Š” ์—๋„ˆ์ง€๋ฅผ ํ›ˆ๋ จ์ด ์ง„ํ–‰๋จ์— ๋”ฐ๋ผ ๋™์ ์œผ๋กœ ๊ฐ์†Œ์‹œํ‚ฌ ์ˆ˜ ์žˆ๋Š” ์‹œ์Šคํ…œ ๊ตฌํ˜„ ๊ธฐ๋ฒ• ๋˜ํ•œ ์†Œ๊ฐœํ•˜๊ณ  ๊ทธ ์„ฑ๋Šฅ์„ ๋ถ„์„ํ•˜์˜€๋‹ค. ์ด๋Ÿฐ ๊ธฐ๋ฒ•์„ ํ†ตํ•ด, ์ด ํ•™์Šต ์‹œ์Šคํ…œ์€ ๊ธฐ์กด์˜ ํ›ˆ๋ จ ์‹œ์Šคํ…œ ๋Œ€๋น„ ๋›ฐ์–ด๋‚œ ๋ถ„๋ฅ˜ ์„ฑ๋Šฅ-์—๋„ˆ์ง€ ์†Œ๋ชจ๋Ÿ‰ ๊ด€๊ณ„๋ฅผ ๋ณด์ด๋ฉด์„œ๋„ ๊ธฐ์กด์˜ ์—ญ์ „ํŒŒ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๊ธฐ๋ฐ˜์˜ ์ธ๊ณต ์‹ ๊ฒฝ๋ง์˜ ํ›ˆ๋ จ ์„ฑ๋Šฅ์„ ์œ ์ง€ํ•˜์˜€๋‹ค. ๋‘˜์งธ๋กœ, ํŠน์ˆ˜ ๋ช…๋ น์–ด ์ฒด๊ณ„ ๋ฐ ๋งž์ถคํ˜• ์ˆ˜ ์ฒด๊ณ„๋ฅผ ํ™œ์šฉํ•œ ํ”„๋กœ๊ทธ๋žจ ๊ฐ€๋Šฅํ•œ DNN ํ›ˆ๋ จ์šฉ ํ”„๋กœ์„ธ์„œ๊ฐ€ ์„ค๊ณ„๋˜๊ณ  ์ œ์ž‘๋˜์—ˆ๋‹ค. ๊ธฐ์กด DNN ์ถ”๋ก ์šฉ ๊ฐ€์†๊ธฐ๋Š” 8๋น„ํŠธ ์ •์ˆ˜ ๊ธฐ๋ฐ˜์œผ๋กœ ์ด๋ฃจ์–ด์ง„ ๊ฒฝ์šฐ๊ฐ€ ๋งŽ์•˜์ง€๋งŒ, DNN ํ•™์Šต ์„ค๊ณ„์‹œ 8๋น„ํŠธ ์ˆ˜ ์ฒด๊ณ„๋ฅผ ์ด์šฉํ•˜๋ฉฐ ํ›ˆ๋ จ ์„ฑ๋Šฅ ์ €ํ•˜๋ฅผ ๋ณด์ด์ง€ ์•Š๋Š” ๊ฒƒ์€ ์ƒ๋‹นํ•œ ๊ธฐ์ˆ ์  ๋‚œ์ด๋„๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ์—ˆ๋‹ค. ์ด๋Ÿฐ ๋ฌธ์ œ๋ฅผ ๊ทน๋ณตํ•˜๊ธฐ ์œ„ํ•ด, ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ๊ณต์œ ํ˜• ๋ฉฑ์ง€์ˆ˜ ํŽธํ–ฅ๊ฐ’์„ ํ™œ์šฉํ•˜๋Š” 8๋น„ํŠธ ๋ถ€๋™ ์†Œ์ˆ˜์  ์ˆ˜ ์ฒด๊ณ„๋ฅผ ์ƒˆ๋กœ์ด ์ œ์•ˆํ•˜์˜€์œผ๋ฉฐ, ์ด ์ˆ˜ ์ฒด๊ณ„์˜ ํšจ์šฉ์„ฑ์„ ๋ณด์ด๊ธฐ ์œ„ํ•ด ์ด DNN ํ›ˆ๋ จ ํ”„๋กœ์„ธ์„œ๊ฐ€ ์„ค๊ณ„๋˜์—ˆ๋‹ค. ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ, ์ด ํ”„๋กœ์„ธ์„œ๋Š” ๋‹จ์ˆœํ•œ MAC ๊ธฐ๋ฐ˜ Matrix-Multiplication ๊ฐ€์†๊ธฐ๊ฐ€ ์•„๋‹Œ, Fused-Multiply-Add ํŠธ๋ฆฌ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•˜๋Š” ์—๋„ˆ์ง€ ํšจ์œจ์ ์ธ ๊ฐ€์†๊ธฐ ๊ตฌ์กฐ๋ฅผ ์ฑ„ํƒํ•˜๋ฉด์„œ๋„, ์นฉ ๋‚ด๋ถ€์—์„œ์˜ ๋ฐ์ดํ„ฐ ์ด๋™๋Ÿ‰ ์ตœ์ ํ™” ๋ฐ ์ปจ๋ณผ๋ฃจ์…˜์˜ ๊ณต๊ฐ„์„ฑ์„ ๊ทน๋Œ€ํ™”ํ•  ์ˆ˜ ์žˆ๊ธฐ ์œ„ํ•ด ๋ฐ์ดํ„ฐ ์ „๋‹ฌ ์œ ๋‹›์„ ์ž…์ถœ๋ ฅ๋ถ€์— 2D๋กœ ์ œ์ž‘ํ•˜์—ฌ ํŠธ๋ฆฌ ๊ธฐ๋ฐ˜์—์„œ์˜ ์ปจ๋ณผ๋ฃจ์…˜ ์ถ”๋ก  ๋ฐ ํ›ˆ๋ จ ๋‹จ๊ณ„์—์„œ์˜ ๊ณต๊ฐ„์„ฑ์„ ํ™œ์šฉํ•  ์ˆ˜ ์žˆ๋Š” ๋ฐฉ๋ฒ•์„ ์ œ์‹œํ•˜์˜€๋‹ค. ๋ณธ DNN ํ›ˆ๋ จ ํ”„๋กœ์„ธ์„œ๋Š” ๋งž์ถคํ˜• ๋ฒกํ„ฐ ์—ฐ์‚ฐ๊ธฐ, ๊ฐ€์† ๋ช…๋ น์–ด ์ฒด๊ณ„, ์™ธ๋ถ€ DRAM์œผ๋กœ์˜ ์ง์ ‘์ ์ธ ์ ‘๊ทผ ์ œ์–ด ๋ฐฉ์‹ ๋“ฑ์„ ํ†ตํ•ด ํ•œ ํ”„๋กœ์„ธ์„œ ๋‚ด์—์„œ DNN ํ›ˆ๋ จ์˜ ๋ชจ๋“  ๋‹จ๊ณ„๋ฅผ ๋‹ค์–‘ํ•œ ๋ชจ๋ธ ๋ฐ ํ™˜๊ฒฝ์—์„œ ํšจ์œจ์ ์œผ๋กœ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ๋„๋ก ์„ค๊ณ„๋˜์—ˆ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด ๋ณธ ํ”„๋กœ์„ธ์„œ๋Š” ๊ธฐ์กด์˜ ์—ฐ๊ตฌ์—์„œ ์ œ์‹œ๋˜์—ˆ๋˜ ๋‹ค๋ฅธ ํ”„๋กœ์„ธ์„œ์— ๋น„ํ•ด ๋™์ผ ๋ชจ๋ธ์„ ์ฒ˜๋ฆฌํ•˜๋ฉด์„œ 2.48๋ฐฐ ๊ฐ€๋Ÿ‰ ๋” ๋†’์€ ์—๋„ˆ์ง€ ํšจ์œจ์„ฑ, 43% ์ ์€ DRAM ์ ‘๊ทผ ์š”๊ตฌ๋Ÿ‰, 0.8%p ๋†’์€ ํ›ˆ๋ จ ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ•˜์˜€๋‹ค. ์ด๋ ‡๊ฒŒ ์†Œ๊ฐœ๋œ ๋‘ ๊ฐ€์ง€ ์„ค๊ณ„๋Š” ๋ชจ๋‘ ์‹ค์ œ ์นฉ์œผ๋กœ ์ œ์ž‘๋˜์–ด ๊ฒ€์ฆ๋˜์—ˆ๋‹ค. ์ธก์ • ๋ฐ์ดํ„ฐ ๋ฐ ์ „๋ ฅ ์†Œ๋ชจ๋Ÿ‰์„ ํ†ตํ•ด ๋ณธ ๋…ผ๋ฌธ์—์„œ ์ œ์•ˆ๋œ ์ €์ „๋ ฅ ๋”ฅ๋Ÿฌ๋‹ ํ›ˆ๋ จ ์‹œ์Šคํ…œ ์„ค๊ณ„ ๊ธฐ๋ฒ•์˜ ํšจ์œจ์„ ๊ฒ€์ฆํ•˜์˜€์œผ๋ฉฐ, ํŠนํžˆ ์ƒ๋ฌผํ•™์  ๋ชจ์‚ฌ๋„๊ฐ€ ๋†’์€ ํ›ˆ๋ จ ์•Œ๊ณ ๋ฆฌ์ฆ˜, ๋”ฅ๋Ÿฌ๋‹ ํ›ˆ๋ จ์— ์ตœ์ ํ™”๋œ ์ˆ˜ ์ฒด๊ณ„, ๊ทธ๋ฆฌ๊ณ  ํšจ์œจ์ ์ธ ์‹œ์Šคํ…œ ๊ตฌํ˜„ ๊ธฐ๋ฒ•์„ ํ™œ์šฉํ•˜์—ฌ ์‹œ์Šคํ…œ์˜ ์—๋„ˆ์ง€ ํšจ์œจ์„ฑ์„ ๊ฐœ์„ ํ•˜๋Š” ๋ชฉํ‘œ๋ฅผ ๋‹ฌ์„ฑํ•˜์˜€๋Š”์ง€ ์ •๋Ÿ‰์ ์œผ๋กœ ๋ถ„์„ํ•˜์˜€๋‹ค.With the advent of the deep learning era, the computational need for processing deep neural networks (DNN) have increased dramatically, both in terms of performing training the neural networks on various tasks as well as in performing inference on the trained neural networks for specific use cases. To address those needs, many custom hardware ranging from systems based on field-programmable gate arrays (FPGA) or application-specific integrated circuits (ASIC) for deployment inside data centers to acceleration blocks in system-on-chip (SoC) for low-power processing in mobile devices were proposed. In this dissertation, custom integrated circuits hardware for energy efficient processing of training neural networks are designed, fabricated, and measured for evaluation of different methodologies that could be utilized for more energy efficient processing under same training performance constraints. In particular, these methodologies are categorized to three different categories for evaluation: (1) Training algorithm. While standard deep neural network training is performed with the back-propagation (BP) algorithm, we investigate various training algorithms, such as neuromorphic learning algorithms with spiking neurons or bio-plausible algorithms with asymmetric feedback for exploiting computational properties for more efficient hardware implementation. (2) Low-precision arithmetic. One of the most powerful methods for increased efficiency in DNN accelerators is through scaling numerical precision. While utilizing low precision numerics for inference phase of DNNs is well studied, training DNNs without performance degradation is relatively more challenging. A novel numerical scheme for training DNNs in various models and scenarios is proposed in this dissertation. (3) System implementation techniques. In actual realization of a custom training system in integrated circuits, nearly infinite design space leads to vastly different quality of results depending on dataflow inside the chip, system load balancing, acceleration and gating blocks, et cetera. Different design techniques which leads to better performance and efficiency are introduced in this dissertation. First, a neuromorphic learning system for classifying handwritten digits (MNIST) is introduced. This learning system aims to deliver low training overhead while maintaining the training performance of classical machine learning. In order to achieve this goal, a neuromorphic learning algorithm is modified for lower operation count and memory buffer requirement while maintaining or even obtaining higher machine learning performance. Moreover, implementation techniques such as update skipping mechanism and lock-free parameter updates allow even lower training overhead, dynamically reducing training energy overhead from 25.6% to 7.5%. With these proposed methodologies, this system greatly improves the accuracy-energy trade-off in on-chip learning system as well as showing close learning performance to classical DNN training through back propagation. Second, a programmable DNN training processor with a custom numerical format is introduced. While prior DNN inference accelerators have utilized 8-bit integers, implementing 8-bit numerics for a training accelerator remained to be a challenge due to higher precision requirements in the backward step of DNN training. To overcome this limitation, a custom 8-bit floating point format dubbed 8-bit floating point with shared exponent bias (FP8-SEB) is introduced in this dissertation. Moreover, a processing architecture of 24-way fused-multiply-adder (FMA) tree greatly increases processing energy efficiency per MAC, while complemented with a novel 2-dimensional routing data-path for making use of spatiality to increase data reuse in both forward, backward, and weight gradient step of convolutional neural networks. This DNN training processor is implemented with a custom vector processing unit, acceleration instructions, and DMA in external DRAMs for end-to-end DNN training in various models and datasets. Compared against prior low-precision training processor in ResNet-18 training, this work achieves 2.48ร— higher energy efficiency, 43% less DRAM accesses, and 0.8\p higher training accuracy. Both of the designs introduced are fabricated in real silicon and verified both in simulations and in physical measurements. Design methodologies are carefully evaluated using simulations of the fabricated chip and measurements with monitored data and power consumption under varying conditions that expose the design techniques in effect. The efficiency of various biologically plausible algorithms, novel numerical formats, and system implementation techniques are analyzed in discussed in this dissertations based on the obtained measurements.Abstract i Contents iv List of Tables vii List of Figures viii 1 Introduction 1 1.1 Study Background 1 1.2 Purpose of Research 6 1.3 Contents 8 2 Hardware-Friendly Learning Algorithms 9 2.1 Modified Learning Rule for Neuromorphic System 9 2.1.1 The Segregated Dendrites Algorithm 9 2.1.2 Modification of the Segregated Dendrites Algorithm 13 2.2 Non-BP Learning Rules on DNN Training Processor 18 2.2.1 Feedback Alignment and Direct Feedback Alignment 18 2.2.2 Reduced Memory Access in Non-BP Learning Rules 23 3 Optimal Numerical Format for DNN Training 27 3.1 Related Works 27 3.2 Proposed FP8 with Shared Exponent Bias 30 3.3 Training Results with FP8-SEB 33 3.4 Fused Multiply Adder Tree for FP8-SEB 37 4 System Implementations 41 4.1 Neuromorphic Learning System 41 4.1.1 Bio-Plausibility 41 4.1.2 Top Level Architecture 43 4.1.3 Lock-Free Weight Updates 47 4.1.4 Update Skipping Mechanism 48 4.2 Low-Precision DNN Training System 51 4.2.1 Top Level Architecture 52 4.2.2 Optimized Auxiliary Instructions in the Vector Processing Unit 55 4.2.3 Buffer Organization 57 4.2.4 Input-Output 2D Spatial Routing for FMA Trees 60 5 Measurement Results 70 5.1 Measurement Results on the Neuromorphic Learning System 70 5.1.1 Measurement Results and Test Setup . 70 5.1.2 Comparison against other works 73 5.1.3 Scalability of the Learning Algorithm 77 5.2 Measurements Results on the Low-Precision DNN Training Processor 79 5.2.1 Measurement Results in Benchmarked Tests 79 5.2.2 Comparison Against Other DNN Training Processors 89 6 Conclusion 93 6.1 Discussion for Future Works 93 6.1.1 Scaling to CNNs in the Neuromorphic System 93 6.1.2 Discussions for Improvements on DNN Training Processor 96 6.2 Conclusion 99 Abstract (In Korean) 108๋ฐ•

    Second year technical report on-board processing for future satellite communications systems

    Get PDF
    Advanced baseband and microwave switching techniques for large domestic communications satellites operating in the 30/20 GHz frequency bands are discussed. The nominal baseband processor throughput is one million packets per second (1.6 Gb/s) from one thousand T1 carrier rate customer premises terminals. A frequency reuse factor of sixteen is assumed by using 16 spot antenna beams with the same 100 MHz bandwidth per beam and a modulation with a one b/s per Hz bandwidth efficiency. Eight of the beams are fixed on major metropolitan areas and eight are scanning beams which periodically cover the remainder of the U.S. under dynamic control. User signals are regenerated (demodulated/remodulated) and message packages are reformatted on board. Frequency division multiple access and time division multiplex are employed on the uplinks and downlinks, respectively, for terminals within the coverage area and dwell interval of a scanning beam. Link establishment and packet routing protocols are defined. Also described is a detailed design of a separate 100 x 100 microwave switch capable of handling nonregenerated signals occupying the remaining 2.4 GHz bandwidth with 60 dB of isolation, at an estimated weight and power consumption of approximately 400 kg and 100 W, respectively

    Dynamically reconfigurable architecture for embedded computer vision systems

    Get PDF
    The objective of this research work is to design, develop and implement a new architecture which integrates on the same chip all the processing levels of a complete Computer Vision system, so that the execution is efficient without compromising the power consumption while keeping a reduced cost. For this purpose, an analysis and classification of different mathematical operations and algorithms commonly used in Computer Vision are carried out, as well as a in-depth review of the image processing capabilities of current-generation hardware devices. This permits to determine the requirements and the key aspects for an efficient architecture. A representative set of algorithms is employed as benchmark to evaluate the proposed architecture, which is implemented on an FPGA-based system-on-chip. Finally, the prototype is compared to other related approaches in order to determine its advantages and weaknesses
    • โ€ฆ
    corecore