5,092 research outputs found

    Full-System Simulation of Mobile CPU/GPU Platforms

    Get PDF
    Graphics Processing Units (GPUs) critically rely on a complex system software stack comprising kernel- and userspace drivers and Just-in-time (JIT) compilers. Yet, existing GPU simulators typically abstract away details of the software stack and GPU instruction set. Partly, this is because GPU vendors rarely release sufficient information about their latest GPU products. However, this is also due to the lack of an integrated CPU/GPU simulation framework, which is complete and powerful enough to drive the complex GPU software environment. This has led to a situation where research on GPU architectures and compilers is largely based on outdated or greatly simplified architectures and software stacks, undermining the validity of the generated results. In this paper we develop a full-system system simulation environment for a mobile platform, which enables users to run a complete and unmodified software stack for a state-of-the-art mobile Arm CPU and Mali-G71 GPU powered device. We validate our simulator against a hardware implementation and Armโ€™s stand-alone GPU simulator, achieving 100% architectural accuracy across all available toolchains. We demonstrate the capability of our GPU simulation framework by optimizing an advanced Computer Vision application using simulated statistics unavailable with other simulation approaches or physical GPU implementations. We demonstrate that performance optimizations for desktop GPUs trigger bottlenecks on mobile GPUs, and show the importance of efficient memory use.Postprin

    Enabling GPU Support for the COMPSs-Mobile Framework

    Get PDF
    Using the GPUs embedded in mobile devices allows for increasing the performance of the applications running on them while reducing the energy consumption of their execution. This article presents a task-based solution for adaptative, collaborative heterogeneous computing on mobile cloud environments. To implement our proposal, we extend the COMPSs-Mobile framework โ€“ an implementation of the COMPSs programming model for building mobile applications that offload part of the computation to the Cloud โ€“ to support offloading computation to GPUs through OpenCL. To evaluate our solution, we subject the prototype to three benchmark applications representing different application patterns.This work is partially supported by the Joint-Laboratory on Extreme Scale Computing (JLESC), by the European Union through the Horizon 2020 research and innovation programme under contract 687584 (TANGO Project), by the Spanish Goverment (TIN2015-65316-P, BES-2013-067167, EEBB-2016-11272, SEV-2011-00067) and the Generalitat de Catalunya (2014-SGR-1051).Peer ReviewedPostprint (author's final draft

    CPU/GPU ์ด์ข… ๋ณ‘๋ ฌ ํ”Œ๋žซํผ์„ ์œ„ํ•œ GPU-in-the-loop ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ๊ธฐ๋ฒ•

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (๋ฐ•์‚ฌ)-- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ์ „๊ธฐยท์ปดํ“จํ„ฐ๊ณตํ•™๋ถ€, 2016. 2. ํ•˜์ˆœํšŒ.๋ณต์žกํ•œ 3D ๊ฒŒ์ž„์„ ์ฒ˜๋ฆฌํ•˜๊ฑฐ๋‚˜, ๋†’์€ ๋ฐ˜์‘์„ฑ์„ ๊ฐ€์ง€๋Š” ์œ ์ €์ธํ„ฐํŽ˜์ด์Šค๋ฅผ ์ œ๊ณตํ•˜๊ธฐ ์œ„ํ•ด์„œ, ๋Œ€๋ถ€๋ถ„์˜ ์ž„๋ฒ ๋””๋“œ ์‹œ์Šคํ…œ์—์„œ ๋ชจ๋ฐ”์ผ GPU ๊ฐ€ ์‚ฌ์šฉ๋˜๊ณ  ์žˆ๋‹ค. ๊ฒŒ๋‹ค๊ฐ€, ๋ชจ๋ฐ”์ผ GPU ์˜ ๊ณ„์‚ฐ ๋Šฅ๋ ฅ์ด ๋†’์•„์ง€๊ณ , GPU ์— ๋Œ€ํ•œ ํ”„๋กœ๊ทธ๋ž˜๋ฐ์ด ๊ฐ€๋Šฅํ•ด์ง์— ๋”ฐ๋ผ, ๋ชจ๋ฐ”์ผ GPU ๊ฐ€ ํ•˜๋‚˜์˜ ๋ณด์กฐ ์—ฐ์‚ฐ ์žฅ์น˜๋กœ์„œ ์—ฌ๊ฒจ์ง€๊ณ  ์žˆ๋‹ค. ๋ชจ๋ฐ”์ผ GPU ์˜ ๊ฒฝ์šฐ, ์„œ๋ฒ„ ํ™˜๊ฒฝ๊ณผ ๋‹ฌ๋ฆฌ ์ œ์•ฝ๋œ ํŒŒ์›Œ์ƒ์—์„œ ์ˆ˜ํ–‰๋˜์–ด์•ผ ํ•˜๋ฏ€๋กœ, ๋Œ€๊ฒŒ ์ ์€ ์ˆ˜์˜ ์ฝ”์–ด๋ฅผ ํฌํ•จํ•œ๋‹ค. ๊ทธ๋Ÿฌ๋ฏ€๋กœ, ์ฃผ์–ด์ง„ ์„ฑ๋Šฅ๊ณผ ํŒŒ์›Œ ์ œ์•ฝ ์กฐ๊ฑด์„ ๋งŒ์กฑ์‹œํ‚ค๊ธฐ ์œ„ํ•ด์„œ๋Š” CPU ์™€ GPU ๋ชจ๋‘๋ฅผ ํšจ์œจ์ ์œผ๋กœ ํ™œ์šฉํ•˜๋Š” ๊ฒƒ์ด ๋งค์šฐ ์ค‘์š”ํ•˜๋‹ค. CPU/GPU ์ด์ข… ๋ณ‘๋ ฌ ์•„ํ‚คํ…์ณ๋ฅผ ์„ค๊ณ„ํ•˜๋Š” ์ดˆ๊ธฐ ๋‹จ๊ณ„์—์„œ SW ์— ๋Œ€ํ•œ ์˜ค๋ฅ˜๋ฅผ ๊ฒ€์ถœํ•˜๊ฑฐ๋‚˜ ๋˜๋Š” ๋‹ค์–‘ํ•œ ์„ค๊ณ„ ๊ณต๊ฐ„ ํƒ์ƒ‰์„ ์œ„ํ•ด์„œ, ๊ฐ€์ƒ ํ”„๋กœํ† ํƒ€์ดํ•‘ ์‹œ์Šคํ…œ์„ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ด ์ผ๋ฐ˜์ ์ด๋‹ค. ๊ฐ€์ƒ ํ”„๋กœํ† ํƒ€์ดํ•‘ ์‹œ์Šคํ…œ์—์„œ๋Š” ๋Œ€์ƒํ•˜๋Š” ์‹œ์Šคํ…œ์˜ ๋ชจ๋“  ๊ตฌ์„ฑ์š”์†Œ์— ๋Œ€ํ•œ ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ๋ชจ๋ธ์„ ํฌํ•จํ•˜๋ฏ€๋กœ, CPU ์™€ GPU ๊ฐ€ ํฌํ•จ๋˜๋Š” ์ด์ข… ๋ณ‘๋ ฌ ์•„ํ‚คํ…์ณ๋ฅผ ์œ„ํ•ด์„œ๋Š” GPU ์— ๋Œ€ํ•œ ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ๋ชจ๋ธ์ด ๋ฐ˜๋“œ์‹œ ํ•„์š”ํ•˜๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์ผ๋ถ€ GPU ์˜ ๊ฒฝ์šฐ, ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ๋ชจ๋ธ์ด ์กด์žฌํ•˜์ง€ ์•Š๊ณ , ์žˆ๋Š” ๊ฒฝ์šฐ์—๋„ ์ฃผ๋กœ ๋งˆ์ดํฌ๋กœ ์•„ํ‚คํ…์ณ ์ˆ˜์ค€์—์„œ์˜ ์•„ํ‚คํ…์ณ ํƒ์ƒ‰์„ ์œ„ํ•œ ๋ชฉ์ ์œผ๋กœ ๊ฐœ๋ฐœ๋˜์–ด, ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ์„ฑ๋Šฅ์ด ์ข‹์ง€ ์•Š๋‹ค. ์ด๋Ÿฌํ•œ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด์„œ, ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์‹ค์ œ ํ•˜๋“œ์›จ์–ด์™€ ์‹œ๋ฎฌ๋ ˆ์ดํ„ฐ๋ฅผ ๊ฒฐํ•ฉํ•˜๋Š” GPU-in-the-loop (GIL) ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ๊ธฐ๋ฒ•์„ ์ œ์•ˆํ•˜๋ ค๊ณ  ํ•œ๋‹ค. ์ œ์•ˆํ•˜๋Š” ๋ฐฉ๋ฒ•์˜ ๊ฒฝ์šฐ, ๋‹ค์–‘ํ•œ ์ˆ˜์ค€์—์„œ CPU ์™€ GPU ๊ฐ„์˜ ์—ฐ๋™์ด ๊ฐ€๋Šฅํ•œ๋ฐ, ์ฒซ๋ฒˆ์งธ ๋ฐฉ๋ฒ•์œผ๋กœ ์‹œ์Šคํ…œ ์ฝœ ์ˆ˜์ค€์—์„œ ์‹œ๋ฎฌ๋ ˆ์ดํ„ฐ์™€ GPU ๋ณด๋“œ ๊ฐ„์˜ ์—ฐ๋™ํ•˜๋Š” ๊ธฐ๋ฒ•์„ ์ œ์•ˆํ•œ๋‹ค. ์ œ์•ˆํ•˜๋Š” ๊ธฐ๋ฒ•์—์„œ๋Š” ๋Œ€์ƒ ์‹œ์Šคํ…œ์— ์žˆ๋Š” ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๊ฐ€ ์‹œ๋ฎฌ๋ ˆ์ดํ„ฐ์™€ ๋ณด๋“œ ์ƒ์— ์กด์žฌํ•˜๋Š” ์„œ๋กœ ๋‹ค๋ฅธ ๋‘๊ฐœ์˜ ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ํ†ตํ•ด ์‹œ๋ฎฌ๋ ˆ์ด์…˜์ด ๋˜๋ฏ€๋กœ, ๋‘ ๋ฉ”๋ชจ๋ฆฌ ๊ฐ„์˜ ์ผ๊ด€์„ฑ์„ ์œ ์ง€ํ•˜๊ธฐ ์œ„ํ•œ ๋ฉ”๋ชจ๋ฆฌ ๋™๊ธฐํ™”๊ฐ€ ๊ฐ€์žฅ ์ค‘์š”ํ•œ ๋ฌธ์ œ์ด๋‹ค. ์‹œ์Šคํ…œ ์ฝœ ๊ธฐ๋ฐ˜ ๊ธฐ๋ฒ•์—์„œ ์ด ๋ฌธ์ œ๋ฅผ ๋‹ค๋ฃจ๊ธฐ ์œ„ํ•ด์„œ, ์ฃผ์†Œ ๋ณ€ํ™˜ ํ…Œ์ด๋ธ”์„ ํ†ตํ•ด์„œ ๊ณต์œ  ๋˜๋Š” ๋ฉ”๋ชจ๋ฆฌ ์˜์—ญ์— ๋Œ€ํ•œ ์ •๋ณด๋ฅผ ์ €์žฅํ•˜๊ณ , ์‹ค์ œ ๋ณด๋“œ ์ƒ์˜ GPU ๋ฅผ ์ˆ˜ํ–‰์‹œํ‚ค๋Š” System Call ์ด ์š”์ฒญ๋  ๋•Œ๋งˆ๋‹ค, ํ•ด๋‹น ํ…Œ์ด๋ธ”์„ ์ด์šฉํ•˜์—ฌ ๊ณต์œ  ๋˜๋Š” ์˜์—ญ์— ๋Œ€ํ•œ ๋™๊ธฐํ™”๊ฐ€ ์ˆ˜ํ–‰๋œ๋‹ค. GPU ์˜ ์ˆ˜ํ–‰์„ ์‹œ๋ฎฌ๋ ˆ์ดํ„ฐ ์ƒ์—์„œ ๋ชจ๋ธ๋งํ•˜๊ธฐ ์œ„ํ•ด, ์ธํ„ฐ๋ŸฝํŠธ ๊ธฐ๋ฐ˜ ๋ชจ๋ธ๋ง ๊ธฐ๋ฒ•์„ ์ œ์•ˆํ•˜์˜€๋Š”๋ฐ, ์ด ๊ธฐ๋ฒ•์—์„œ๋Š” ๋ณด๋“œ์—์„œ ์ธก์ •๋œ GPU ์ˆ˜ํ–‰์‹œ๊ฐ„์„ ๊ณ ๋ คํ•˜์—ฌ, ์‹œ๋ฎฌ๋ ˆ์ดํ„ฐ ์ƒ์—์„œ GPU ์ธํ„ฐ๋ŸฝํŠธ๋ฅผ ๋ฐœ์ƒํ•˜๋„๋ก ํ•œ๋‹ค. ๋‘๋ฒˆ์งธ ๋ฐฉ๋ฒ•์œผ๋กœ API ์ˆ˜์ค€์—์„œ ์‹œ๋ฎฌ๋ ˆ์ดํ„ฐ์™€ ๋ณด๋“œ ๊ฐ„์˜ ์—ฐ๋™ํ•˜๋Š” ๊ธฐ๋ฒ•์„ ์ œ์•ˆํ•œ๋‹ค. ๊ธฐ์กด Software Stack ์— ํฌํ•จ๋œ ๋””๋ฐ”์ด์Šค ๋“œ๋ผ์ด๋ฒ„๊ฐ€ ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ๋˜๋Š” ๊ฒฝ์šฐ, ๋‹ค์–‘ํ•œ GPU ๋ฅผ ์ง€์›ํ•˜๋„๋ก ํ™•์žฅํ•˜๋Š” ๊ฒƒ์ด ์–ด๋ ค์šฐ๋ฏ€๋กœ, API ๊ธฐ๋ฐ˜ ๊ธฐ๋ฒ•์—์„œ๋Š” ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ์šฉ๋„๋กœ ์‚ฌ์šฉ๋˜๋Š” ์ƒˆ๋กœ์šด ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ์ •์˜ํ•˜๊ณ , ๊ธฐ์กด SW stack ์ƒ์— ์กด์žฌํ•˜๋Š” GPU ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ๋Œ€์ฒดํ•˜๋„๋ก ํ•˜์—ฌ, ๋””๋ฐ”์ด์Šค ๋“œ๋ผ์ด๋ฒ„๊ฐ€ ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ๋˜์ง€ ์•Š๋„๋ก ํ•œ๋‹ค. ๊ทธ๋ฆฌ๊ณ  API ์ˆ˜ํ–‰์‹œ๊ฐ„์„ ์‹œ๋ฎฌ๋ ˆ์ดํ„ฐ ์ƒ์—์„œ ๋ชจ๋ธ๋งํ•˜๊ธฐ ์œ„ํ•ด์„œ, ์‹œ๋ฎฌ๋ ˆ์ด์…˜์„ ์œ„ํ•œ ์ƒˆ๋กœ์šด ๋””๋ฐ”์ด์Šค ๋“œ๋ผ์ด๋ฒ„๋ฅผ ์ •์˜ํ•˜์—ฌ, ํ•ด๋‹น ๋“œ๋ผ์ด๋ฒ„ ๋‚ด์—์„œ sleep ํ•จ์ˆ˜๋ฅผ ํ˜ธ์ถœํ•˜์—ฌ, ๋ณด๋“œ์—์„œ ์ธก์ •๋œ API ์‹œ๊ฐ„์ด ์‹œ๋ฎฌ๋ ˆ์ดํ„ฐ์ƒ์— ๋ฐ˜์˜๋˜๊ฒŒ ๋œ๋‹ค. ํ˜„์กดํ•˜๋Š” GPU API ์ค‘์—์„œ, ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ๊ฐ€์žฅ ๋งŽ์ด ์‚ฌ์šฉ๋˜๋Š” OpenCL, CUDA ๊ทธ๋ฆฌ๊ณ  OpenGL ES API ์— ๋Œ€ํ•œ API ๊ธฐ๋ฐ˜ ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ๊ธฐ๋ฒ•์„ ์ œ์•ˆํ•œ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ์˜ฌ๋ฐ”๋ฅธ ์‹œ๋ฎฌ๋ ˆ์ด์…˜์„ ์œ„ํ•ด์„œ, ๋น„๋™๊ธฐ ๋™์ž‘, ๋ฉ€ํ‹ฐํ”„๋กœ์„ธ์Šค ์ง€์›, ๋ณต์žกํ•œ ๋ฐ์ดํ„ฐ ๊ตฌ์กฐ์— ๋Œ€ํ•œ ๋ฉ”๋ชจ๋ฆฌ ๋™๊ธฐํ™”์™€ ๊ฐ™์€ ์–ด๋ ค์šด ๋ฌธ์ œ๋“ค์„ ๋‹ค์–‘ํ•œ ๊ธฐ๋ฒ•๋“ค์„ ํ†ตํ•ด ํ•ด๊ฒฐํ•˜์˜€๋‹ค. ์‹คํ—˜ ๊ฒฐ๊ณผ๋ฅผ ํ†ตํ•ด์„œ, ์ œ์•ˆ๋œ ๊ธฐ๋ฒ•์ด ์ ์ ˆํ•œ ์ˆ˜์ค€์˜ ์ •ํ™•๋„๋ฅผ ๋ณด์žฅํ•˜๋ฉด์„œ, ๋น ๋ฅธ ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ์„ฑ๋Šฅ์„ ์ œ๊ณตํ•  ์ˆ˜ ์žˆ์Œ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค. ๊ทธ๋Ÿฌ๋ฏ€๋กœ, ์ œ์•ˆ๋œ ๊ธฐ๋ฒ•์€ SW ๊ฐœ๋ฐœ ์šฉ๋„๋ฟ๋งŒ ์•„๋‹ˆ๋ผ, ์‹œ์Šคํ…œ ์ˆ˜์ค€์—์„œ์˜ ์„ฑ๋Šฅ ์˜ˆ์ธก์„ ์œ„ํ•œ ์šฉ๋„๋กœ์„œ ์‚ฌ์šฉ์ด ๊ฐ€๋Šฅํ•˜๋‹ค. ๊ฒŒ๋‹ค๊ฐ€, ์ œ์•ˆ๋œ ๊ธฐ๋ฒ•์˜ ๊ฒฝ์šฐ, ์‹ค์ œ ํ•˜๋“œ์›จ์–ด๊ฐ€ ์‚ฌ์šฉ๋˜๋ฏ€๋กœ, GPU ์— ๋Œ€ํ•œ ์‹œ๋ฎฌ๋ ˆ์ดํ„ฐ๊ฐ€ ์ œ๊ณต๋˜์ง€ ์•Š๋Š” ๊ฒฝ์šฐ์—๋„ CPU/GPU ์ด์ข… ๋ณ‘๋ ฌ ์‹œ์Šคํ…œ์„ ์œ„ํ•œ ๊ฐ€์ƒ ํ”„๋กœํ† ํƒ€์ดํ•‘ ์‹œ์Šคํ…œ์„ ๊ตฌ์ถ•ํ•˜๋Š” ๊ฒƒ์ด ๊ฐ€๋Šฅํ•˜๋‹ค.A mobile GPU has been widely adopted in most embedded systems to handle the complex graphics computations required in modern 3D games and highly interactive UI (User Interface). Moreover, as mobile GPUs are gaining more computation power and becoming increasingly programmable, they are also used to accelerate general-purpose computations in various fields such as physics and math, and so on. Unlike server GPUs, mobile GPUs usually have fewer cores since a limited amount of power is available in a battery. Thus, it is important to efficiently utilize both CPUs and GPUs in mobile platforms to satisfy the performance and power constraints. For design space exploration of such a CPU-GPU heterogeneous architecture or debugging the SW in the early design stage, a full system simulator is typically used, in which simulation models of all HW components in the target system is included. Unfortunately, building a full system simulator with GPU simulator is not always possible because there is no available GPU simulator, or if any, it is prohibitively slow since they are mainly developed for architecture exploration varying the internal micro-architecture of GPUs. To solve these problems, this thesis proposes a GPU-in-the-loop (GIL) simulation technique that integrates a real GPU with a full system simulator for CPU/GPU heterogeneous platforms. In the first part of this thesis, we propose a system call-level simulation technique in which a full system simulator interacts with a GPU board at system call level. Since the shared on-chip memory in the target system is modeled by two separate memories in the simulator and the board, memory synchronization is the most challenging problem in the proposed technique. To handle this problem in the system call-level technique, address translation tables are maintained for the shared memory regions and these memory regions are synchronized whenever the system calls which trigger the GPU execution are invoked in the board. To model the GPU execution in the simulator, interrupt-based modeling technique is proposed, in which the GPU interrupt is generated in consideration of the GPU execution time obtained from the real board. In the second part of this thesis, we propose an API-level simulation technique in which a simulator and a board interact with each other at API level. Since the device driver in the original software stack makes it difficult to support various GPUs, a synthetic library is defined and it replaces the GPU library in the original software stack in order to ensure that the device driver is not used. To model timing of the API execution in the simulator, the sleep function is called in the synthetic driver so that the measured API time in the board elapses in the simulated time. From the existing GPU APIs, we propose API-level simulation techniques for three commonly used APIs which are OpenCL, CUDA and OpenGL ES. And several challenging problems such as asynchronous behavior, multi-process support and memory synchronization for complex data structures are properly handled by several methods for correct simulation. From the experimental results, we can confirm that the proposed technique can provide fast simulation speed with a reasonable timing accuracy. Therefore, it can be used not only for SW development but also for system level performance estimation. Moreover, the proposed technique makes the full system simulation for CPU/GPU heterogeneous platforms feasible even if a GPU simulator is not available.Chapter 1 Introduction 1 1.1 Motivation 1 1.2 Contribution 4 1.3 Thesis Organization 6 Chapter 2 Related Works 7 2.1 Acceleration techniques for GPU simulation 7 2.1.1 Parallel Simulation 8 2.1.2 Sampled Simulation 9 2.1.3 Statistical Simulation 11 2.1.4 HW-accelerated Simulation 11 2.2 CPU/GPU Simulation framework 12 2.3 Summary 15 Chapter 3 GPU-in-the-loop Simulation 18 3.1 Basic Idea 18 3.2 Different levels of CPU/GPU Interaction 20 3.3 Detection Mechanism 21 3.4 Memory Coherency Problem 23 3.5 Overall GIL simulation flow 23 Chapter 4 System call- level GIL Simulation 26 4.1 Target System 26 4.1.1 Typical Execution Scenario of the Systems 27 4.2 Memory Synchronization 29 4.2.1 Address Translation Table 30 4.3 Timing Modeling 32 4.3.1 Interrupt Modeling 33 4.3.2 Regression based timing correction for GPU time 34 4.3.3 An Example of System-level GIL Simulation Scenario 35 4.4 Experiments 37 4.4.1 Parallelization for diff operation 37 4.4.2 Simulation Time Analysis 39 4.4.3 Contention overhead in Pixel Processors (PP) 40 4.4.4 Internal System Behavior Profiling 41 4.4.5 Accuracy Evaluation 42 4.5 Summary 43 Chapter 5 API-Level GIL Simulation 44 5.1 Differences between API-level and System call-level techniques 45 5.1.1 Synthetic Library 47 5.2 Timing Modeling 49 5.2.1 Regression-based compensation for timing error 51 5.3 Memory Synchronization 52 5.4 GPGPU API (CUDA & OpenCL) Implementation Case 55 5.4.1 Asynchronous Behavior Modeling 55 5.4.2 Implementation Issues 58 5.4.3 Experiments 61 5.4.4 Simulation Overhead 68 5.5 OpenGL ES Implementation Case 69 5.5.1 Background 69 5.5.2 Additional modification for SW stack 71 5.5.3 Memory synchronization 72 5.5.4 Multi-Process Support 77 5.5.5 High-level Timing Modeling for other GPUs 79 5.5.6 Porting To a New GPU Board 81 5.5.7 Experiments 83 5.6 Summary 92 Chapter 6 Conclusion and Future Work 94 Bibliography 98 ์ดˆ๋ก 105Docto

    Distributed learning of CNNs on heterogeneous CPU/GPU architectures

    Get PDF
    Convolutional Neural Networks (CNNs) have shown to be powerful classification tools in tasks that range from check reading to medical diagnosis, reaching close to human perception, and in some cases surpassing it. However, the problems to solve are becoming larger and more complex, which translates to larger CNNs, leading to longer training times that not even the adoption of Graphics Processing Units (GPUs) could keep up to. This problem is partially solved by using more processing units and distributed training methods that are offered by several frameworks dedicated to neural network training. However, these techniques do not take full advantage of the possible parallelization offered by CNNs and the cooperative use of heterogeneous devices with different processing capabilities, clock speeds, memory size, among others. This paper presents a new method for the parallel training of CNNs that can be considered as a particular instantiation of model parallelism, where only the convolutional layer is distributed. In fact, the convolutions processed during training (forward and backward propagation included) represent from 6060-9090\% of global processing time. The paper analyzes the influence of network size, bandwidth, batch size, number of devices, including their processing capabilities, and other parameters. Results show that this technique is capable of diminishing the training time without affecting the classification performance for both CPUs and GPUs. For the CIFAR-10 dataset, using a CNN with two convolutional layers, and 500500 and 15001500 kernels, respectively, best speedups achieve 3.28ร—3.28\times using four CPUs and 2.45ร—2.45\times with three GPUs. Modern imaging datasets, larger and more complex than CIFAR-10 will certainly require more than 6060-9090\% of processing time calculating convolutions, and speedups will tend to increase accordingly
    • โ€ฆ
    corecore