1,291 research outputs found

    Test-Delivery Optimization in Manycore SOCs

    Get PDF
    We present two test-data delivery optimization algorithms for system-on-chip (SOC) designs with hundreds of cores, where a network-on-chip (NOC) is used as the interconnection fabric. We first present an e ective algorithm based on a subsetsum formulation to solve the test-delivery problem in NOCs with arbitrary topology that use dedicated routing. We further propose an algorithm for the important class of NOCs with grid topology and XY routing. The proposed algorithm is the first to co-optimize the number of access points, access-point locations, pin distribution to access points, and assignment of cores to access points for optimal test resource utilization of such NOCs. Testtime minimization is modeled as an NOC partitioning problem and solved with dynamic programming in polynomial time. Both the proposed methods yield high-quality results and are scalable to large SOCs with many cores. We present results on synthetic grid topology NOC-based SOCs constructed using cores from the ITCโ€™02 benchmark, and demonstrate the scalability of our approach for two SOCs of the future, one with nearly 1,000 cores and the other with 1,600 cores. Test scheduling under power constraints is also incorporated in the optimization framework

    Exploiting partial reconfiguration through PCIe for a microphone array network emulator

    Get PDF
    The current Microelectromechanical Systems (MEMS) technology enables the deployment of relatively low-cost wireless sensor networks composed of MEMS microphone arrays for accurate sound source localization. However, the evaluation and the selection of the most accurate and power-efficient networkโ€™s topology are not trivial when considering dynamic MEMS microphone arrays. Although software simulators are usually considered, they consist of high-computational intensive tasks, which require hours to days to be completed. In this paper, we present an FPGA-based platform to emulate a network of microphone arrays. Our platform provides a controlled simulated acoustic environment, able to evaluate the impact of different network configurations such as the number of microphones per array, the networkโ€™s topology, or the used detection method. Data fusion techniques, combining the data collected by each node, are used in this platform. The platform is designed to exploit the FPGAโ€™s partial reconfiguration feature to increase the flexibility of the network emulator as well as to increase performance thanks to the use of the PCI-express high-bandwidth interface. On the one hand, the network emulator presents a higher flexibility by partially reconfiguring the nodesโ€™ architecture in runtime. On the other hand, a set of strategies and heuristics to properly use partial reconfiguration allows the acceleration of the emulation by exploiting the execution parallelism. Several experiments are presented to demonstrate some of the capabilities of our platform and the benefits of using partial reconfiguration

    SoC Test Applications Using ACO metaheuristic

    Get PDF

    ์ž„๋ฒ ๋””๋“œ ์‹œ์Šคํ…œ์—์„œ ์—ฌ๋Ÿฌ ์ปจ๋ณผ๋ฃจ์…˜ ๋‰ด๋Ÿด ๋„คํŠธ์›Œํฌ๋ฅผ ์œ„ํ•œ ํ•˜๋“œ์›จ์–ด๋ฅผ ๊ณ ๋ คํ•˜๋Š” ์†Œํ”„ํŠธ์›จ์–ด ์ตœ์ ํ™” ๊ธฐ๋ฒ•

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ปดํ“จํ„ฐ๊ณตํ•™๋ถ€, 2021. 2. ํ•˜์ˆœํšŒ.์ž„๋ฒ ๋””๋“œ ๊ธฐ๊ธฐ๋Š” ๋Œ€๊ฐœ ๊ณ„์‚ฐ๋Ÿ‰, ๋ฉ”๋ชจ๋ฆฌ ํฌ๊ธฐ, ์—๋„ˆ์ง€ ์†Œ๋ชจ๋Ÿ‰ ๋“ฑ์˜ ์ œ์•ฝ ์‚ฌํ•ญ์ด ์žˆ๊ธฐ ๋•Œ๋ฌธ์—, ๋”ฅ ๋Ÿฌ๋‹ ์‘์šฉ์„ ์ž„๋ฒ ๋””๋“œ ๊ธฐ๊ธฐ์—์„œ ์ˆ˜ํ–‰ํ•˜๋Š” ๊ฒƒ์€ ์‰ฝ์ง€ ์•Š๋‹ค. ๋”ฅ ๋Ÿฌ๋‹ ์‘์šฉ์˜ ๊ณ„์‚ฐ๋Ÿ‰ ์ฆ๊ฐ€๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด์„œ ์—๋„ˆ์ง€ ํšจ์œจ์ ์ธ ๋ชจ๋ฐ”์ผ GPU, ๋””์ง€ํ„ธ ์‹ ํ˜ธ ์ฒ˜๋ฆฌ ํ”„๋กœ์„ธ์„œ์„ ์‚ฌ์šฉํ•˜๊ฑฐ๋‚˜, ๋˜๋Š” ์ƒˆ๋กœ์šด ๋‰ด๋Ÿด ํ”„๋กœ์„ธ์„œ ์นฉ์„ ๋งŒ๋“œ๋ ค๋Š” ํ•˜๋“œ์›จ์–ด ์˜์—ญ์˜ ์ตœ์ ํ™” ๋ฐฉ๋ฒ•์ด ์žˆ๋‹ค. ๋ฐ˜๋ฉด์— ๋”ฅ ๋Ÿฌ๋‹ ์‘์šฉ ์˜์—ญ์—์„œ๋Š” ์ƒˆ๋กœ์šด ๋”ฅ ๋Ÿฌ๋‹ ์‘์šฉ์„ ๋งŒ๋“ค๊ฑฐ๋‚˜, ๋”ฅ ๋Ÿฌ๋‹์˜ ํ†ต๊ณ„์ ์ธ ํŠน์„ฑ์„ ์ด์šฉํ•œ ๊ทผ์‚ฌ ๊ณ„์‚ฐ ๋ฐฉ๋ฒ•์„ ์ด์šฉํ•˜์—ฌ ์ตœ์ ํ™” ๋ฐฉ๋ฒ•์„ ์ œ์•ˆํ•˜๊ณ  ์žˆ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ๋˜ ๋‹ค๋ฅธ ์ตœ์ ํ™” ๋ฐฉ๋ฒ•์œผ๋กœ๋Š” ๋จผ์ € ํ•˜๋“œ์›จ์–ด ํ”Œ๋žซํผ์˜ ์„ฑ๋Šฅ ๋ณ‘๋ชฉ ๋ถ€๋ถ„์„ ์ฐพ๊ณ , ์ผ์„ ๋™๋“ฑํ•˜๊ฒŒ ์—ฌ๋Ÿฌ ๊ณ„์‚ฐ ์ž์›์— ๋ถ„๋ฐฐํ•˜์—ฌ ์ตœ์ ํ™”ํ•˜๋Š” ํ•˜๋“œ์›จ์–ด๋ฅผ ๊ณ ๋ คํ•œ ์ตœ์ ํ™” ๋ฐฉ๋ฒ•์ด ์žˆ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ํ•˜๋“œ์›จ์–ด๋ฅผ ๊ณ ๋ คํ•œ ์†Œํ”„ํŠธ์›จ์–ด ์ตœ์ ํ™” ๋ฐฉ๋ฒ•๋“ค์„ ๊ณ ์•ˆํ•˜์˜€๋‹ค. ๋จผ์ €, LPIRC ๋Œ€ํšŒ์— ์ฐธ๊ฐ€ํ•œ ๊ฒฝํ—˜์„ ๋ฐ”ํƒ•์œผ๋กœ ์ž„๋ฒ ๋””๋“œ ๋”ฅ ๋Ÿฌ๋‹ ์‹œ์Šคํ…œ์„ ์ตœ์ ํ™”ํ•˜๋Š” ์ฒด๊ณ„์ ์ธ ๋ฐฉ๋ฒ•๋ก ์„ ๊ณ ์•ˆํ•˜๊ณ , ๊ทธ ๋ฐฉ๋ฒ•๋ก ์— ๋”ฐ๋ฅธ C-GOOD์ด๋ผ๋Š” ๋”ฅ ๋Ÿฌ๋‹ ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ๊ตฌํ˜„ํ•˜์˜€๋‹ค. C-GOOD์€ ํ•˜๋“œ์›จ์–ด ํ”Œ๋žซํผ์— ๋…๋ฆฝ์ ์œผ๋กœ ์ž‘๋™ํ•˜๊ธฐ ์œ„ํ•ด ๋Œ€๋ถ€๋ถ„์˜ ์ž„๋ฒ ๋””๋“œ ๊ธฐ๊ธฐ์—์„œ ์ปดํŒŒ์ผ, ์ˆ˜ํ–‰์ด ๊ฐ€๋Šฅํ•œ C ์ฝ”๋“œ๋ฅผ ์ƒ์„ฑํ•œ๋‹ค. ๋˜ํ•œ ์—ฌ๋Ÿฌ ๊ฐ€์ง€ ๋”ฅ ๋Ÿฌ๋‹ ์‘์šฉ ์˜์—ญ์˜ ์ตœ์ ํ™” ๋ฐฉ๋ฒ•์„ ์ ์šฉํ•  ์ˆ˜ ์žˆ๋Š” ์˜ต์…˜๊ณผ ์‹œ์Šคํ…œ ์„ฑ๋Šฅ์„ ์ธก์ •ํ•  ์ˆ˜ ์žˆ๋Š” ๊ธฐ๋Šฅ์„ ์ œ๊ณตํ•˜์˜€๋‹ค. ์ด ๋ฐฉ๋ฒ•๋ก ์„ Jetson TX2, Odroid XU4, SRP ๋“ฑ์˜ ์„œ๋กœ ๋‹ค๋ฅธ 3๊ฐœ์˜ ๊ธฐ๊ธฐ์— ์ ์šฉํ•ด ๋ด„์œผ๋กœ์จ, ๊ณ ์•ˆ๋œ ๋ฐฉ๋ฒ•๋ก ์ด ํ•˜๋“œ์›จ์–ด ํ”Œ๋žซํผ์— ๋…๋ฆฝ์ ์ด๋ฉฐ C-GOOD์„ ํ†ตํ•ด ์‰ฝ๊ฒŒ ์—ฌ๋Ÿฌ ๋”ฅ ๋Ÿฌ๋‹ ์‘์šฉ ์ตœ์ ํ™” ๋ฐฉ๋ฒ•์„ ์ ์šฉํ•  ์ˆ˜ ์žˆ์Œ์„ ํ™•์ธํ•˜์˜€๋‹ค. ์ตœ๊ทผ ์ž„๋ฒ ๋””๋“œ ๊ธฐ๊ธฐ์— ์ด์ข… ํ”„๋กœ์„ธ์„œ๋“ค์ด ๋งŽ์ด ํƒ‘์žฌ๋˜๊ณ  ์žˆ๊ณ , ๋™์‹œ์— ์ž์œจ ์ฃผํ–‰ ์ž๋™์ฐจ์™€ ์Šค๋งˆํŠธํฐ ๋“ฑ์˜ ํ•˜๋‚˜์˜ ์ž„๋ฒ ๋””๋“œ ๊ธฐ๊ธฐ์—์„œ ์—ฌ๋Ÿฌ ๊ฐœ์˜ ๋”ฅ ๋Ÿฌ๋‹ ์‘์šฉ์„ ๋™์‹œ์— ์ˆ˜ํ–‰ํ•˜๋Š” ๊ฒƒ์ด ํ•„์š”ํ•ด์ง€๊ณ  ์žˆ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์—ฌ๋Ÿฌ ๋”ฅ ๋Ÿฌ๋‹ ์‘์šฉ์„ ์ด์ข… ํ”„๋กœ์„ธ์„œ๋“ค์„ ํƒ‘์žฌํ•œ ์ž„๋ฒ ๋””๋“œ ๊ธฐ๊ธฐ์— ์Šค์ผ€์ค„ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ๊ณ ์•ˆํ•˜๊ณ , ์Šค์ผ€์ค„๋ง ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ๊ตฌํ˜„ํ•˜์˜€๋‹ค. ์ด ๋ฐฉ๋ฒ•๋ก ์€ ์‹ค์ œ ๊ธฐ๊ธฐ์—์„œ์˜ ํ”„๋กœํŒŒ์ผ๋ง๋ถ€ํ„ฐ ์Šค์ผ€์ค„ ๊ฒฐ๊ณผ๋ฅผ ์‹ค์ œ ๊ธฐ๊ธฐ์—์„œ ํ™•์ธํ•˜๋Š” ๊ณผ์ •๊นŒ์ง€ ํฌํ•จํ•˜๋ฉฐ, ์‹ค์ œ ๊ธฐ๊ธฐ์—์„œ ๋ฐœ์ƒํ•˜๋Š” ์ด์Šˆ๋“ค์ธ DVFS, CPU Hot-plug ๋“ฑ์„ ๊ณ ๋ คํ•˜์˜€๋‹ค. ์ด์ข… ํ”„๋กœ์„ธ์„œ๋กœ์˜ ์Šค์ผ€์ค„๋ง ๊ธฐ๋ฒ•์œผ๋กœ๋Š” ๋งŽ์ด ์‚ฌ์šฉ๋˜๋Š” ๋ฉ”ํƒ€ ํœด๋ฆฌ์Šคํ‹ฑ ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ์œ ์ „ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์‚ฌ์šฉํ•˜์˜€๋‹ค. ํŠนํžˆ, ์„œ๋กœ ๋‹ค๋ฅธ ์ฃผ๊ธฐ์™€ ์ƒ๋Œ€ ์˜คํ”„์…‹์„ ๊ฐ€์ง€๊ณ  ์žˆ๋Š” ์—ฌ๋Ÿฌ ์‘์šฉ์„ ๋™์‹œ์— ์Šค์ผ€์ค„ํ•˜๊ธฐ ์œ„ํ•ด์„œ ๋ชจ๋“  ํƒœ์Šคํฌ๋“ค์˜ ์Šค์ผ€์ค„ ๊ฐ€๋Šฅ์„ฑ์„ ๊ณ ๋ คํ•˜์—ฌ ์Šค์ผ€์ค„ํ•˜์˜€๋‹ค. ์Šค์ผ€์ค„ ๊ฒฐ๊ณผ๋ฅผ ๊ฒ€์ฆํ•˜๊ธฐ ์œ„ํ•ด์„œ, ACL์˜ ์ฝ”์–ด ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ์ด์šฉํ•˜์—ฌ ๋”ฅ ๋Ÿฌ๋‹ ์ถ”๋ก  ์‘์šฉ์„ ๊ตฌํ˜„ํ•˜์˜€์œผ๋ฉฐ, ์Šค์ผ€์ค„ ๊ฒฐ๊ณผ์™€ ๊ฐ™์ด ๊ฐ ๋ ˆ์ด์–ด๋“ค์„ ์‹ค์ œ ํ•˜๋“œ์›จ์–ด์˜ ์„œ๋กœ ๋‹ค๋ฅธ ํ”„๋กœ์„ธ์„œ ๋งคํ•‘ํ•˜๋„๋ก ๊ตฌํ˜„ํ•˜์˜€๋‹ค. ๊ฐค๋Ÿญ์‹œ S9 ์Šค๋งˆํŠธํฐ๊ณผ Hikey 970 ๋ณด๋“œ์—์„œ ์„œ๋กœ ๋‹ค๋ฅธ ๋‘๊ฐœ์˜ ๋”ฅ ๋Ÿฌ๋‹ ๋„คํŠธ์›Œํฌ๋ฅผ ์ˆ˜ํ–‰ํ•˜๊ณ , ์Šค์ผ€์ค„ ๊ฒฐ๊ณผ์™€ ๋น„๊ตํ•˜์—ฌ ๋ฐฉ๋ฒ•๋ก ์„ ๊ฒ€์ฆํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค. ์ด์ „ ์ตœ์ ํ™” ๋ฐฉ๋ฒ•๋“ค์ด ๋”ฅ ๋Ÿฌ๋‹ ์‘์šฉ์˜ ๊ณ„์‚ฐ๋Ÿ‰๊ณผ ํ”„๋กœ์„ธ์„œ๋“ค์— ์ง‘์ค‘ํ•˜์˜€๋Š”๋ฐ, ๋”ฅ ๋Ÿฌ๋‹ ๊ฐ€์†๊ธฐ ๋˜๋Š” NPU์˜ ์„ฑ๋Šฅ ๋ณ‘๋ชฉ์ด ์ƒ๊ธฐ๋Š” ์›์ธ์€ ์˜คํ”„ ์นฉ ๋ฉ”๋ชจ๋ฆฌ์™€ ์˜จ ์นฉ ์‚ฌ์ด์˜ ํ†ต์‹ ์ด๋‹ค. ๋”์šฑ์ด ์˜คํ”„ ์นฉ ๋ฉ”๋ชจ๋ฆฌ DRAM ์ ‘๊ทผ์€ NPU์˜ ์ „๋ ฅ์†Œ๋ชจ์˜ ๋งŽ์€ ๋ถ€๋ถ„์„ ์ฐจ์ง€ํ•œ๋‹ค๊ณ  ์•Œ๋ ค์ ธ์žˆ๋‹ค. ๋”ฐ๋ผ์„œ ์ด์™€ ๊ฐ™์€ ์˜คํ”„ ์นฉ DRAM ์ ‘๊ทผ์œผ๋กœ ์ธํ•œ NPU์˜ ์„ฑ๋Šฅ๊ณผ ์—๋„ˆ์ง€ ์˜ํ–ฅ์„ ์ค„์ด๊ณ ์ž ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์˜จ ์นฉ ๋ฉ”๋ชจ๋ฆฌ ๋ฑ…ํฌ๋ฅผ ๊ด€๋ฆฌํ•˜๋Š” ์ปดํŒŒ์ผ๋Ÿฌ ๊ธฐ๋ฒ•์„ ๊ณ ์•ˆํ•˜์˜€๋‹ค. ์˜จ ์นฉ ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์—ฌ๋Ÿฌ ๊ฐœ์˜ ๋ฑ…ํฌ๋กœ ๊ตฌ์„ฑํ•˜๊ณ  ์—ฐ์‚ฐ ๋„์ค‘์— ์ธํ’‹ ๋ฐ์ดํ„ฐ๋ฅผ ๋ฏธ๋ฆฌ ๋กœ๋“œํ•จ์œผ๋กœ์จ ์—ฐ์‚ฐ ์ง€์—ฐ ์‹œ๊ฐ„์„ ์ค„์ผ ์ˆ˜ ์žˆ๋‹ค๋Š” ์ ๊ณผ ๋ ˆ์ด์–ด์˜ ์•„์›ƒํ’‹์„ ์˜จ ์นฉ ๋ฉ”๋ชจ๋ฆฌ์—์„œ ์žฌ์‚ฌ์šฉํ•˜์—ฌ ์˜คํ”„ ์นฉ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ์„ ์ค„์ผ ์ˆ˜ ์žˆ๋‹ค๋Š” ์ ์„ ์ด์šฉํ•˜์—ฌ ์„œ๋กœ ๋‹ค๋ฅธ ๋‘ ๊ฐ€์ง€์˜ ๋ชฉ์  ํ•จ์ˆ˜๋ฅผ ๊ฐ€์ง„ ๋‘ ๊ฐ€์ง€ ๊ธฐ๋ฒ•์„ ๊ณ ์•ˆํ•˜์˜€๋‹ค. ๋ชฉ์  ํ•จ์ˆ˜๋Š” ๊ฐ๊ฐ ์˜คํ”„ ์นฉ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ์„ ์ตœ์†Œํ™”ํ•˜๋Š” ๊ฒƒ๊ณผ ์˜คํ”„ ์นฉ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ์œผ๋กœ ์ธํ•œ ํ”„๋กœ์„ธ์„œ๋“ค์˜ ์ฒ˜๋ฆฌ ์ง€์—ฐ์‹œ๊ฐ„์„ ์ค„์ด๋Š” ๊ฒƒ์ด๋‹ค. ์„œ๋กœ ๋‹ค๋ฅธ 5๊ฐœ์˜ ๋”ฅ ๋Ÿฌ๋‹ ๋„คํŠธ์›Œํฌ๋ฅผ ์‚ฌ์ดํด ๋ ˆ๋ฒจ NPU ์‹œ๋ฎฌ๋ ˆ์ดํ„ฐ์—์„œ ์ˆ˜ํ–‰ํ•˜์—ฌ ๋‘ ๋ชฉ์  ํ•จ์ˆ˜์— ๋”ฐ๋ฅธ ์ ˆ์ถฉ (Trade-off) ๊ด€๊ณ„ ๋ฅผ ํ™•์ธํ•˜์˜€๋‹ค. ๋˜ํ•œ ์˜จ ์นฉ ๋ฉ”๋ชจ๋ฆฌ ๋ฑ…ํฌ ๊ด€๋ฆฌ ๊ธฐ๋ฒ•์„ ๋ ˆ์ด์–ด ๊ฐ„ ํ”ผ์ฒ˜ ๋ฐ์ดํ„ฐ๋ฅผ ์ตœ๋Œ€ํ•œ ์žฌ์‚ฌ์šฉํ•˜๋Š” ๋ ˆ์ด์–ด ์œตํ•ฉ ๋ฐฉ๋ฒ•์œผ๋กœ ํ™•์žฅํ•˜์˜€๋‹ค. ๊ธฐ์กด์˜ ์ˆœ์ˆ˜ํ•œ ๋ ˆ์ด์–ด ์œตํ•ฉ ๋ฐฉ๋ฒ•์˜ ๊ฒฝ์šฐ์—๋Š” ์ค‘๋ณต ๊ณ„์‚ฐํ•˜๋Š” ์˜ค๋ฒ„ํ—ค๋“œ์™€ ์ถ”๊ฐ€์ ์ธ ํ•„ํ„ฐ ์›จ์ดํŠธ ๋กœ๋“œ๊ฐ€ ์ƒ๊ธด๋‹ค. ๋”ฐ๋ผ์„œ ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ๊ธฐ์กด์˜ ๋ ˆ์ด์–ด ๋ณ„๋กœ ์ฒ˜๋ฆฌํ•˜๋Š” ๋ฐฉ๋ฒ•๊ณผ ์ˆœ์ˆ˜ํ•œ ๋ ˆ์ด์–ด ์œตํ•ฉ ๋ฐฉ๋ฒ• ์‚ฌ์ด์˜ ํ•˜์ด๋ธŒ๋ฆฌ๋“œ ๋ ˆ์ด์–ด ์œตํ•ฉ ๋ฐฉ๋ฒ•์„ ๊ณ ์•ˆํ•˜์˜€๋‹ค. ๋‘ ์˜จ ์นฉ ๋ฉ”๋ชจ๋ฆฌ ๋ฑ…ํฌ ๊ด€๋ฆฌ ๊ธฐ๋ฒ•์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•˜์ด๋ธŒ๋ฆฌ๋“œ ๋ ˆ์ด์–ด ์œตํ•ฉ ๋ฐฉ๋ฒ•์ด ๊ธฐ์กด์˜ ๋ ˆ์ด์–ด ๋ณ„ ์ฒ˜๋ฆฌํ•˜๋Š” ๊ธฐ๋ฒ•๊ณผ ์ˆœ์ˆ˜ํ•œ ๋ ˆ์ด์–ด ์œตํ•ฉ ๋ฐฉ๋ฒ•๋ณด๋‹ค ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์ž„์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค.Executing deep learning algorithms on mobile embedded devices is challenging because embedded devices usually have tight constraints on the computational power, memory size, and energy consumption, while the resource requirements of deep learning algorithms achieving high accuracy continue to increase. To cope with increasing computation complexity, it is common to use an energy-efficient accelerator, such as a mobile GPU or digital signal processor (DSP) array, or to develop a customized neural processor chip called neural processing unit (NPU). In the application domain, many optimization techniques have been proposed to change the application algorithm in order to reduce the computational amount and memory usage by developing new deep learning networks or software optimization techniques that take advantage of the statistical nature of deep learning algorithms. Another approach is hardware-ware software optimization, which finds the performance bottleneck first and then distributes the workload evenly by scheduling the workloads. This dissertation covers hardware-aware software optimization, which is based on a hardware processor or platform. First, we devise a systematic optimization methodology through the experience of participating in the Low Power Image Recognition Challenge (LPIRC) and build a deep learning framework called C-GOOD (C-code Generation Framework for Optimized On-device Deep Learning) based on the devised methodology. For hardware independence, C-GOOD generates a C code that can be compiled for and run on any embedded device. Also, C-GOOD is facilitated with various options for application domain optimization that can be performed according to the devised methodology. By applying the devised methodology to three hardware platforms, NVIDIA Jetson TX2, Odroid XU4, and the Samsung Reconfigurable Processor (SRP), we demonstrate that the devised methodology is independent of the hardware platforms and application domain optimizations can be performed easily with C-GOOD. Recently, embedded devices are equipped with heterogeneous processing elements (PEs), and the need for running multiple deep learning applications concurrently in the embedded systems such as self-driving cars and smartphones is increasing at the same time. In those systems, we devise an end-to-end methodology to schedule deep learning applications onto heterogeneous PEs and implement a scheduling framework according to the methodology. It covers from profiling on real embedded devices to verifying the schedule results on the devices. In this methodology, we use a genetic algorithm (GA)-based scheduling technique for scheduling deep learning applications onto heterogeneous PEs and consider several practical issues in the profile step. Furthermore, we schedule multiple applications with different throughput constraints considering the schedulability of mapped tasks on each processor. After implementing a deep learning inference engine that can utilize heterogeneous PEs using a low-level library of the ARM compute library (ACL), we verify the devised methodology by running two widely used convolution neural networks (CNNs) on a Galaxy S9 smartphones and a Hikey970 board. While the previous optimization methods focus on the computation and processing elements, the performance bottleneck of deep learning accelerators is the communication between off-chip and on-chip memory. Moreover, the off-chip DRAM access volume has a significant effect on the energy consumption of an NPU. To reduce the impact of off-chip DRAM access on the performance and energy of an NPU, we devise compiler techniques for an NPU to manage multi-bank on-chip memory with two different objectives: one is to minimize the off-chip memory access volume, and the other is to minimize the processing delay caused by unhidden DRAM accesses. The main idea is that by organizing on-chip memory into multiple banks, we may hide the off-chip DRAM access delay by prefetching data into unused banks during computation and reduce the off-chip DRAM access volume by storing the output feature map data of each layer to on-chip memory. By running CNN benchmarks on a cycle-level NPU simulator, we demonstrate the trade-off relation between two objectives. The devised multi-bank on-chip memory management (MOMM) techniques are extended to consider layer fusion that aims to reuse feature maps between layers maximally. Since the pure layer fusion technique incurs extra computation overhead and increases DRAM access for filter weights, a hybrid fusion technique is presented between a per-layer processing technique and the pure layer fusion techniques, based on the devised MOMM techniques with two different objectives. Experiment results confirm the superiority of the hybrid fusion technique to the per-layer processing technique and the pure layer fusion technique.Abstract Contents List of Figures List of Tables List of Algorithms Chapter 1 Introduction 1 1.1 Motivation 1 1.2 Contribution 7 1.3 Dissertation Organization 8 Chapter 2 Background 9 2.1 Target Hardware 9 2.1.1 Commodity Hardware Platform 9 2.1.2 Application-specific Hardware Accelerator 10 2.2 Convolutional Neural Network 11 2.2.1 Convolution 11 2.2.2 Optimization Methods for Convolutional Neural Network 11 Chapter 3 Optimization for a Commodity Hardware Platform 14 3.1 Joint Optimization Method of Multiple Objectives 15 3.1.1 Hardware Platform 16 3.1.2 Deep Neural Network and Software Framework 17 3.1.3 Software Optimization Techniques 19 3.2 C-code Generation Framework for Optimized On-device Deep Learning 29 3.2.1 C-GOOD Framework 29 3.2.2 Experiments 36 3.3 Scheduling Deep Learning Applications Onto Heterogeneous Processors 44 3.3.1 Search Space Size 45 3.3.2 Hardware Platform and System Model 45 3.3.3 Proposed Scheduling Framework and Profiling 48 3.3.4 Scheduling a Single Deep Learning Application 53 3.3.5 Scheduling Multiple Deep Learning Applications 61 3.3.6 Verification with Real Hardware Platforms 65 3.4 Related Work 69 3.4.1 Deep Learning Framework 69 3.4.2 Deep Learning Compiler 70 3.4.3 Scheduling Deep Learning Application 70 3.4.4 Scheduling Multiple Applications on Heterogeneous Processors 72 Chapter 4 Optimization for an Application-specific Hardware Accelerator 75 4.1 Multi-Bank On-chip Memory Management Problem 75 4.1.1 Main Idea 75 4.1.2 Assumed Dataflow 76 4.1.3 Multi-bank On-chip Memory Management Problem 79 4.2 Proposed Multi-bank On-chip Memory Management Techniques 83 4.2.1 DRAM-first Storing Policy 84 4.2.2 DRAM Access Minimization Policy (MIN policy) 85 4.2.3 DRAM Access Hiding Policy (HIDE policy) 89 4.2.4 Multiple Path Consideration 91 4.3 Layer Fusion Technique 92 4.3.1 Layer Fusion Technique 92 4.3.2 Hybrid Fusion Technique 94 4.4 Experiments 96 4.4.1 Setup 96 4.4.2 Performance Comparison of MOMM Techniques 98 4.4.3 Multiple Path 100 4.4.4 Design Space Exploration of NPU Architecture 101 4.4.5 Hybrid Fusion Technique 104 4.5 Related Work 106 Chapter 5 Conclusion 108 Bibliography 111 Appendix 120 A Proposed Multi-bank On-chip Memory Management Algorithm 120 A.1 Multi-bank On-chip Memory (MOM) Manager 120 A.2 MIN policy 122 A.3 HIDE policy 124 ์š” ์•ฝ 126Docto

    Power constrained test scheduling in system-on-chip design

    Get PDF
    With the development of VLSI technologies, especially with the coming of deep sub-micron semiconductor process technologies, power dissipation becomes a critical factor that cannot be ignored either in normal operation or in test mode of digital systems. Test scheduling has to take into consideration of both test concurrency and power dissipation constraints. For satisfying high fault coverage goals with minimum test application time under certain power dissipation constraints, the testing of all components on the system should be performed in parallel as much as possible. The main objective of this thesis is to address the test-scheduling problem faced by SOC designers at system level. Through the analysis of several existing scheduling approaches, we enlarge the basis that current approaches based on to minimize test application time and propose an efficient and integrated technique for the test scheduling of SOCs under power-constraint. The proposed merging approach is based on a tree growing technique and can be used to overlay the block-test sessions in order to reduce further test application time. A number of experiments, based on academic benchmarks and industrial designs, have been carried out to demonstrate the usefulness and efficiency of the proposed approaches

    Exploring Processor and Memory Architectures for Multimedia

    Get PDF
    Multimedia has become one of the cornerstones of our 21st century society and, when combined with mobility, has enabled a tremendous evolution of our society. However, joining these two concepts introduces many technical challenges. These range from having sufficient performance for handling multimedia content to having the battery stamina for acceptable mobile usage. When taking a projection of where we are heading, we see these issues becoming ever more challenging by increased mobility as well as advancements in multimedia content, such as introduction of stereoscopic 3D and augmented reality. The increased performance needs for handling multimedia come not only from an ongoing step-up in resolution going from QVGA (320x240) to Full HD (1920x1080) a 27x increase in less than half a decade. On top of this, there is also codec evolution (MPEG-2 to H.264 AVC) that adds to the computational load increase. To meet these performance challenges there has been processing and memory architecture advances (SIMD, out-of-order superscalarity, multicore processing and heterogeneous multilevel memories) in the mobile domain, in conjunction with ever increasing operating frequencies (200MHz to 2GHz) and on-chip memory sizes (128KB to 2-3MB). At the same time there is an increase in requirements for mobility, placing higher demands on battery-powered systems despite the steady increase in battery capacity (500 to 2000mAh). This leaves negative net result in-terms of battery capacity versus performance advances. In order to make optimal use of these architectural advances and to meet the power limitations in mobile systems, there is a need for taking an overall approach on how to best utilize these systems. The right trade-off between performance and power is crucial. On top of these constraints, the flexibility aspects of the system need to be addressed. All this makes it very important to reach the right architectural balance in the system. The first goal for this thesis is to examine multimedia applications and propose a flexible solution that can meet the architectural requirements in a mobile system. Secondly, propose an automated methodology of optimally mapping multimedia data and instructions to a heterogeneous multilevel memory subsystem. The proposed methodology uses constraint programming for solving a multidimensional optimization problem. Results from this work indicate that using todayโ€™s most advanced mobile processor technology together with a multi-level heterogeneous on-chip memory subsystem can meet the performance requirements for handling multimedia. By utilizing the automated optimal memory mapping method presented in this thesis lower total power consumption can be achieved, whilst performance for multimedia applications is improved, by employing enhanced memory management. This is achieved through reduced external accesses and better reuse of memory objects. This automatic method shows high accuracy, up to 90%, for predicting multimedia memory accesses for a given architecture

    Cognition-Based Networks: A New Perspective on Network Optimization Using Learning and Distributed Intelligence

    Get PDF
    IEEE Access Volume 3, 2015, Article number 7217798, Pages 1512-1530 Open Access Cognition-based networks: A new perspective on network optimization using learning and distributed intelligence (Article) Zorzi, M.a , Zanella, A.a, Testolin, A.b, De Filippo De Grazia, M.b, Zorzi, M.bc a Department of Information Engineering, University of Padua, Padua, Italy b Department of General Psychology, University of Padua, Padua, Italy c IRCCS San Camillo Foundation, Venice-Lido, Italy View additional affiliations View references (107) Abstract In response to the new challenges in the design and operation of communication networks, and taking inspiration from how living beings deal with complexity and scalability, in this paper we introduce an innovative system concept called COgnition-BAsed NETworkS (COBANETS). The proposed approach develops around the systematic application of advanced machine learning techniques and, in particular, unsupervised deep learning and probabilistic generative models for system-wide learning, modeling, optimization, and data representation. Moreover, in COBANETS, we propose to combine this learning architecture with the emerging network virtualization paradigms, which make it possible to actuate automatic optimization and reconfiguration strategies at the system level, thus fully unleashing the potential of the learning approach. Compared with the past and current research efforts in this area, the technical approach outlined in this paper is deeply interdisciplinary and more comprehensive, calling for the synergic combination of expertise of computer scientists, communications and networking engineers, and cognitive scientists, with the ultimate aim of breaking new ground through a profound rethinking of how the modern understanding of cognition can be used in the management and optimization of telecommunication network

    ReSP: A Nonintrusive Transaction-Level Reflective MPSoC Simulation Platform for Design Space Exploration

    Full text link
    • โ€ฆ
    corecore