2 research outputs found

    CPU/GPU ์ด์ข… ๋ณ‘๋ ฌ ํ”Œ๋žซํผ์„ ์œ„ํ•œ GPU-in-the-loop ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ๊ธฐ๋ฒ•

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (๋ฐ•์‚ฌ)-- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ์ „๊ธฐยท์ปดํ“จํ„ฐ๊ณตํ•™๋ถ€, 2016. 2. ํ•˜์ˆœํšŒ.๋ณต์žกํ•œ 3D ๊ฒŒ์ž„์„ ์ฒ˜๋ฆฌํ•˜๊ฑฐ๋‚˜, ๋†’์€ ๋ฐ˜์‘์„ฑ์„ ๊ฐ€์ง€๋Š” ์œ ์ €์ธํ„ฐํŽ˜์ด์Šค๋ฅผ ์ œ๊ณตํ•˜๊ธฐ ์œ„ํ•ด์„œ, ๋Œ€๋ถ€๋ถ„์˜ ์ž„๋ฒ ๋””๋“œ ์‹œ์Šคํ…œ์—์„œ ๋ชจ๋ฐ”์ผ GPU ๊ฐ€ ์‚ฌ์šฉ๋˜๊ณ  ์žˆ๋‹ค. ๊ฒŒ๋‹ค๊ฐ€, ๋ชจ๋ฐ”์ผ GPU ์˜ ๊ณ„์‚ฐ ๋Šฅ๋ ฅ์ด ๋†’์•„์ง€๊ณ , GPU ์— ๋Œ€ํ•œ ํ”„๋กœ๊ทธ๋ž˜๋ฐ์ด ๊ฐ€๋Šฅํ•ด์ง์— ๋”ฐ๋ผ, ๋ชจ๋ฐ”์ผ GPU ๊ฐ€ ํ•˜๋‚˜์˜ ๋ณด์กฐ ์—ฐ์‚ฐ ์žฅ์น˜๋กœ์„œ ์—ฌ๊ฒจ์ง€๊ณ  ์žˆ๋‹ค. ๋ชจ๋ฐ”์ผ GPU ์˜ ๊ฒฝ์šฐ, ์„œ๋ฒ„ ํ™˜๊ฒฝ๊ณผ ๋‹ฌ๋ฆฌ ์ œ์•ฝ๋œ ํŒŒ์›Œ์ƒ์—์„œ ์ˆ˜ํ–‰๋˜์–ด์•ผ ํ•˜๋ฏ€๋กœ, ๋Œ€๊ฒŒ ์ ์€ ์ˆ˜์˜ ์ฝ”์–ด๋ฅผ ํฌํ•จํ•œ๋‹ค. ๊ทธ๋Ÿฌ๋ฏ€๋กœ, ์ฃผ์–ด์ง„ ์„ฑ๋Šฅ๊ณผ ํŒŒ์›Œ ์ œ์•ฝ ์กฐ๊ฑด์„ ๋งŒ์กฑ์‹œํ‚ค๊ธฐ ์œ„ํ•ด์„œ๋Š” CPU ์™€ GPU ๋ชจ๋‘๋ฅผ ํšจ์œจ์ ์œผ๋กœ ํ™œ์šฉํ•˜๋Š” ๊ฒƒ์ด ๋งค์šฐ ์ค‘์š”ํ•˜๋‹ค. CPU/GPU ์ด์ข… ๋ณ‘๋ ฌ ์•„ํ‚คํ…์ณ๋ฅผ ์„ค๊ณ„ํ•˜๋Š” ์ดˆ๊ธฐ ๋‹จ๊ณ„์—์„œ SW ์— ๋Œ€ํ•œ ์˜ค๋ฅ˜๋ฅผ ๊ฒ€์ถœํ•˜๊ฑฐ๋‚˜ ๋˜๋Š” ๋‹ค์–‘ํ•œ ์„ค๊ณ„ ๊ณต๊ฐ„ ํƒ์ƒ‰์„ ์œ„ํ•ด์„œ, ๊ฐ€์ƒ ํ”„๋กœํ† ํƒ€์ดํ•‘ ์‹œ์Šคํ…œ์„ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ด ์ผ๋ฐ˜์ ์ด๋‹ค. ๊ฐ€์ƒ ํ”„๋กœํ† ํƒ€์ดํ•‘ ์‹œ์Šคํ…œ์—์„œ๋Š” ๋Œ€์ƒํ•˜๋Š” ์‹œ์Šคํ…œ์˜ ๋ชจ๋“  ๊ตฌ์„ฑ์š”์†Œ์— ๋Œ€ํ•œ ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ๋ชจ๋ธ์„ ํฌํ•จํ•˜๋ฏ€๋กœ, CPU ์™€ GPU ๊ฐ€ ํฌํ•จ๋˜๋Š” ์ด์ข… ๋ณ‘๋ ฌ ์•„ํ‚คํ…์ณ๋ฅผ ์œ„ํ•ด์„œ๋Š” GPU ์— ๋Œ€ํ•œ ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ๋ชจ๋ธ์ด ๋ฐ˜๋“œ์‹œ ํ•„์š”ํ•˜๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์ผ๋ถ€ GPU ์˜ ๊ฒฝ์šฐ, ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ๋ชจ๋ธ์ด ์กด์žฌํ•˜์ง€ ์•Š๊ณ , ์žˆ๋Š” ๊ฒฝ์šฐ์—๋„ ์ฃผ๋กœ ๋งˆ์ดํฌ๋กœ ์•„ํ‚คํ…์ณ ์ˆ˜์ค€์—์„œ์˜ ์•„ํ‚คํ…์ณ ํƒ์ƒ‰์„ ์œ„ํ•œ ๋ชฉ์ ์œผ๋กœ ๊ฐœ๋ฐœ๋˜์–ด, ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ์„ฑ๋Šฅ์ด ์ข‹์ง€ ์•Š๋‹ค. ์ด๋Ÿฌํ•œ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด์„œ, ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์‹ค์ œ ํ•˜๋“œ์›จ์–ด์™€ ์‹œ๋ฎฌ๋ ˆ์ดํ„ฐ๋ฅผ ๊ฒฐํ•ฉํ•˜๋Š” GPU-in-the-loop (GIL) ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ๊ธฐ๋ฒ•์„ ์ œ์•ˆํ•˜๋ ค๊ณ  ํ•œ๋‹ค. ์ œ์•ˆํ•˜๋Š” ๋ฐฉ๋ฒ•์˜ ๊ฒฝ์šฐ, ๋‹ค์–‘ํ•œ ์ˆ˜์ค€์—์„œ CPU ์™€ GPU ๊ฐ„์˜ ์—ฐ๋™์ด ๊ฐ€๋Šฅํ•œ๋ฐ, ์ฒซ๋ฒˆ์งธ ๋ฐฉ๋ฒ•์œผ๋กœ ์‹œ์Šคํ…œ ์ฝœ ์ˆ˜์ค€์—์„œ ์‹œ๋ฎฌ๋ ˆ์ดํ„ฐ์™€ GPU ๋ณด๋“œ ๊ฐ„์˜ ์—ฐ๋™ํ•˜๋Š” ๊ธฐ๋ฒ•์„ ์ œ์•ˆํ•œ๋‹ค. ์ œ์•ˆํ•˜๋Š” ๊ธฐ๋ฒ•์—์„œ๋Š” ๋Œ€์ƒ ์‹œ์Šคํ…œ์— ์žˆ๋Š” ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๊ฐ€ ์‹œ๋ฎฌ๋ ˆ์ดํ„ฐ์™€ ๋ณด๋“œ ์ƒ์— ์กด์žฌํ•˜๋Š” ์„œ๋กœ ๋‹ค๋ฅธ ๋‘๊ฐœ์˜ ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ํ†ตํ•ด ์‹œ๋ฎฌ๋ ˆ์ด์…˜์ด ๋˜๋ฏ€๋กœ, ๋‘ ๋ฉ”๋ชจ๋ฆฌ ๊ฐ„์˜ ์ผ๊ด€์„ฑ์„ ์œ ์ง€ํ•˜๊ธฐ ์œ„ํ•œ ๋ฉ”๋ชจ๋ฆฌ ๋™๊ธฐํ™”๊ฐ€ ๊ฐ€์žฅ ์ค‘์š”ํ•œ ๋ฌธ์ œ์ด๋‹ค. ์‹œ์Šคํ…œ ์ฝœ ๊ธฐ๋ฐ˜ ๊ธฐ๋ฒ•์—์„œ ์ด ๋ฌธ์ œ๋ฅผ ๋‹ค๋ฃจ๊ธฐ ์œ„ํ•ด์„œ, ์ฃผ์†Œ ๋ณ€ํ™˜ ํ…Œ์ด๋ธ”์„ ํ†ตํ•ด์„œ ๊ณต์œ  ๋˜๋Š” ๋ฉ”๋ชจ๋ฆฌ ์˜์—ญ์— ๋Œ€ํ•œ ์ •๋ณด๋ฅผ ์ €์žฅํ•˜๊ณ , ์‹ค์ œ ๋ณด๋“œ ์ƒ์˜ GPU ๋ฅผ ์ˆ˜ํ–‰์‹œํ‚ค๋Š” System Call ์ด ์š”์ฒญ๋  ๋•Œ๋งˆ๋‹ค, ํ•ด๋‹น ํ…Œ์ด๋ธ”์„ ์ด์šฉํ•˜์—ฌ ๊ณต์œ  ๋˜๋Š” ์˜์—ญ์— ๋Œ€ํ•œ ๋™๊ธฐํ™”๊ฐ€ ์ˆ˜ํ–‰๋œ๋‹ค. GPU ์˜ ์ˆ˜ํ–‰์„ ์‹œ๋ฎฌ๋ ˆ์ดํ„ฐ ์ƒ์—์„œ ๋ชจ๋ธ๋งํ•˜๊ธฐ ์œ„ํ•ด, ์ธํ„ฐ๋ŸฝํŠธ ๊ธฐ๋ฐ˜ ๋ชจ๋ธ๋ง ๊ธฐ๋ฒ•์„ ์ œ์•ˆํ•˜์˜€๋Š”๋ฐ, ์ด ๊ธฐ๋ฒ•์—์„œ๋Š” ๋ณด๋“œ์—์„œ ์ธก์ •๋œ GPU ์ˆ˜ํ–‰์‹œ๊ฐ„์„ ๊ณ ๋ คํ•˜์—ฌ, ์‹œ๋ฎฌ๋ ˆ์ดํ„ฐ ์ƒ์—์„œ GPU ์ธํ„ฐ๋ŸฝํŠธ๋ฅผ ๋ฐœ์ƒํ•˜๋„๋ก ํ•œ๋‹ค. ๋‘๋ฒˆ์งธ ๋ฐฉ๋ฒ•์œผ๋กœ API ์ˆ˜์ค€์—์„œ ์‹œ๋ฎฌ๋ ˆ์ดํ„ฐ์™€ ๋ณด๋“œ ๊ฐ„์˜ ์—ฐ๋™ํ•˜๋Š” ๊ธฐ๋ฒ•์„ ์ œ์•ˆํ•œ๋‹ค. ๊ธฐ์กด Software Stack ์— ํฌํ•จ๋œ ๋””๋ฐ”์ด์Šค ๋“œ๋ผ์ด๋ฒ„๊ฐ€ ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ๋˜๋Š” ๊ฒฝ์šฐ, ๋‹ค์–‘ํ•œ GPU ๋ฅผ ์ง€์›ํ•˜๋„๋ก ํ™•์žฅํ•˜๋Š” ๊ฒƒ์ด ์–ด๋ ค์šฐ๋ฏ€๋กœ, API ๊ธฐ๋ฐ˜ ๊ธฐ๋ฒ•์—์„œ๋Š” ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ์šฉ๋„๋กœ ์‚ฌ์šฉ๋˜๋Š” ์ƒˆ๋กœ์šด ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ์ •์˜ํ•˜๊ณ , ๊ธฐ์กด SW stack ์ƒ์— ์กด์žฌํ•˜๋Š” GPU ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ๋Œ€์ฒดํ•˜๋„๋ก ํ•˜์—ฌ, ๋””๋ฐ”์ด์Šค ๋“œ๋ผ์ด๋ฒ„๊ฐ€ ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ๋˜์ง€ ์•Š๋„๋ก ํ•œ๋‹ค. ๊ทธ๋ฆฌ๊ณ  API ์ˆ˜ํ–‰์‹œ๊ฐ„์„ ์‹œ๋ฎฌ๋ ˆ์ดํ„ฐ ์ƒ์—์„œ ๋ชจ๋ธ๋งํ•˜๊ธฐ ์œ„ํ•ด์„œ, ์‹œ๋ฎฌ๋ ˆ์ด์…˜์„ ์œ„ํ•œ ์ƒˆ๋กœ์šด ๋””๋ฐ”์ด์Šค ๋“œ๋ผ์ด๋ฒ„๋ฅผ ์ •์˜ํ•˜์—ฌ, ํ•ด๋‹น ๋“œ๋ผ์ด๋ฒ„ ๋‚ด์—์„œ sleep ํ•จ์ˆ˜๋ฅผ ํ˜ธ์ถœํ•˜์—ฌ, ๋ณด๋“œ์—์„œ ์ธก์ •๋œ API ์‹œ๊ฐ„์ด ์‹œ๋ฎฌ๋ ˆ์ดํ„ฐ์ƒ์— ๋ฐ˜์˜๋˜๊ฒŒ ๋œ๋‹ค. ํ˜„์กดํ•˜๋Š” GPU API ์ค‘์—์„œ, ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ๊ฐ€์žฅ ๋งŽ์ด ์‚ฌ์šฉ๋˜๋Š” OpenCL, CUDA ๊ทธ๋ฆฌ๊ณ  OpenGL ES API ์— ๋Œ€ํ•œ API ๊ธฐ๋ฐ˜ ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ๊ธฐ๋ฒ•์„ ์ œ์•ˆํ•œ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ์˜ฌ๋ฐ”๋ฅธ ์‹œ๋ฎฌ๋ ˆ์ด์…˜์„ ์œ„ํ•ด์„œ, ๋น„๋™๊ธฐ ๋™์ž‘, ๋ฉ€ํ‹ฐํ”„๋กœ์„ธ์Šค ์ง€์›, ๋ณต์žกํ•œ ๋ฐ์ดํ„ฐ ๊ตฌ์กฐ์— ๋Œ€ํ•œ ๋ฉ”๋ชจ๋ฆฌ ๋™๊ธฐํ™”์™€ ๊ฐ™์€ ์–ด๋ ค์šด ๋ฌธ์ œ๋“ค์„ ๋‹ค์–‘ํ•œ ๊ธฐ๋ฒ•๋“ค์„ ํ†ตํ•ด ํ•ด๊ฒฐํ•˜์˜€๋‹ค. ์‹คํ—˜ ๊ฒฐ๊ณผ๋ฅผ ํ†ตํ•ด์„œ, ์ œ์•ˆ๋œ ๊ธฐ๋ฒ•์ด ์ ์ ˆํ•œ ์ˆ˜์ค€์˜ ์ •ํ™•๋„๋ฅผ ๋ณด์žฅํ•˜๋ฉด์„œ, ๋น ๋ฅธ ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ์„ฑ๋Šฅ์„ ์ œ๊ณตํ•  ์ˆ˜ ์žˆ์Œ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค. ๊ทธ๋Ÿฌ๋ฏ€๋กœ, ์ œ์•ˆ๋œ ๊ธฐ๋ฒ•์€ SW ๊ฐœ๋ฐœ ์šฉ๋„๋ฟ๋งŒ ์•„๋‹ˆ๋ผ, ์‹œ์Šคํ…œ ์ˆ˜์ค€์—์„œ์˜ ์„ฑ๋Šฅ ์˜ˆ์ธก์„ ์œ„ํ•œ ์šฉ๋„๋กœ์„œ ์‚ฌ์šฉ์ด ๊ฐ€๋Šฅํ•˜๋‹ค. ๊ฒŒ๋‹ค๊ฐ€, ์ œ์•ˆ๋œ ๊ธฐ๋ฒ•์˜ ๊ฒฝ์šฐ, ์‹ค์ œ ํ•˜๋“œ์›จ์–ด๊ฐ€ ์‚ฌ์šฉ๋˜๋ฏ€๋กœ, GPU ์— ๋Œ€ํ•œ ์‹œ๋ฎฌ๋ ˆ์ดํ„ฐ๊ฐ€ ์ œ๊ณต๋˜์ง€ ์•Š๋Š” ๊ฒฝ์šฐ์—๋„ CPU/GPU ์ด์ข… ๋ณ‘๋ ฌ ์‹œ์Šคํ…œ์„ ์œ„ํ•œ ๊ฐ€์ƒ ํ”„๋กœํ† ํƒ€์ดํ•‘ ์‹œ์Šคํ…œ์„ ๊ตฌ์ถ•ํ•˜๋Š” ๊ฒƒ์ด ๊ฐ€๋Šฅํ•˜๋‹ค.A mobile GPU has been widely adopted in most embedded systems to handle the complex graphics computations required in modern 3D games and highly interactive UI (User Interface). Moreover, as mobile GPUs are gaining more computation power and becoming increasingly programmable, they are also used to accelerate general-purpose computations in various fields such as physics and math, and so on. Unlike server GPUs, mobile GPUs usually have fewer cores since a limited amount of power is available in a battery. Thus, it is important to efficiently utilize both CPUs and GPUs in mobile platforms to satisfy the performance and power constraints. For design space exploration of such a CPU-GPU heterogeneous architecture or debugging the SW in the early design stage, a full system simulator is typically used, in which simulation models of all HW components in the target system is included. Unfortunately, building a full system simulator with GPU simulator is not always possible because there is no available GPU simulator, or if any, it is prohibitively slow since they are mainly developed for architecture exploration varying the internal micro-architecture of GPUs. To solve these problems, this thesis proposes a GPU-in-the-loop (GIL) simulation technique that integrates a real GPU with a full system simulator for CPU/GPU heterogeneous platforms. In the first part of this thesis, we propose a system call-level simulation technique in which a full system simulator interacts with a GPU board at system call level. Since the shared on-chip memory in the target system is modeled by two separate memories in the simulator and the board, memory synchronization is the most challenging problem in the proposed technique. To handle this problem in the system call-level technique, address translation tables are maintained for the shared memory regions and these memory regions are synchronized whenever the system calls which trigger the GPU execution are invoked in the board. To model the GPU execution in the simulator, interrupt-based modeling technique is proposed, in which the GPU interrupt is generated in consideration of the GPU execution time obtained from the real board. In the second part of this thesis, we propose an API-level simulation technique in which a simulator and a board interact with each other at API level. Since the device driver in the original software stack makes it difficult to support various GPUs, a synthetic library is defined and it replaces the GPU library in the original software stack in order to ensure that the device driver is not used. To model timing of the API execution in the simulator, the sleep function is called in the synthetic driver so that the measured API time in the board elapses in the simulated time. From the existing GPU APIs, we propose API-level simulation techniques for three commonly used APIs which are OpenCL, CUDA and OpenGL ES. And several challenging problems such as asynchronous behavior, multi-process support and memory synchronization for complex data structures are properly handled by several methods for correct simulation. From the experimental results, we can confirm that the proposed technique can provide fast simulation speed with a reasonable timing accuracy. Therefore, it can be used not only for SW development but also for system level performance estimation. Moreover, the proposed technique makes the full system simulation for CPU/GPU heterogeneous platforms feasible even if a GPU simulator is not available.Chapter 1 Introduction 1 1.1 Motivation 1 1.2 Contribution 4 1.3 Thesis Organization 6 Chapter 2 Related Works 7 2.1 Acceleration techniques for GPU simulation 7 2.1.1 Parallel Simulation 8 2.1.2 Sampled Simulation 9 2.1.3 Statistical Simulation 11 2.1.4 HW-accelerated Simulation 11 2.2 CPU/GPU Simulation framework 12 2.3 Summary 15 Chapter 3 GPU-in-the-loop Simulation 18 3.1 Basic Idea 18 3.2 Different levels of CPU/GPU Interaction 20 3.3 Detection Mechanism 21 3.4 Memory Coherency Problem 23 3.5 Overall GIL simulation flow 23 Chapter 4 System call- level GIL Simulation 26 4.1 Target System 26 4.1.1 Typical Execution Scenario of the Systems 27 4.2 Memory Synchronization 29 4.2.1 Address Translation Table 30 4.3 Timing Modeling 32 4.3.1 Interrupt Modeling 33 4.3.2 Regression based timing correction for GPU time 34 4.3.3 An Example of System-level GIL Simulation Scenario 35 4.4 Experiments 37 4.4.1 Parallelization for diff operation 37 4.4.2 Simulation Time Analysis 39 4.4.3 Contention overhead in Pixel Processors (PP) 40 4.4.4 Internal System Behavior Profiling 41 4.4.5 Accuracy Evaluation 42 4.5 Summary 43 Chapter 5 API-Level GIL Simulation 44 5.1 Differences between API-level and System call-level techniques 45 5.1.1 Synthetic Library 47 5.2 Timing Modeling 49 5.2.1 Regression-based compensation for timing error 51 5.3 Memory Synchronization 52 5.4 GPGPU API (CUDA & OpenCL) Implementation Case 55 5.4.1 Asynchronous Behavior Modeling 55 5.4.2 Implementation Issues 58 5.4.3 Experiments 61 5.4.4 Simulation Overhead 68 5.5 OpenGL ES Implementation Case 69 5.5.1 Background 69 5.5.2 Additional modification for SW stack 71 5.5.3 Memory synchronization 72 5.5.4 Multi-Process Support 77 5.5.5 High-level Timing Modeling for other GPUs 79 5.5.6 Porting To a New GPU Board 81 5.5.7 Experiments 83 5.6 Summary 92 Chapter 6 Conclusion and Future Work 94 Bibliography 98 ์ดˆ๋ก 105Docto

    Three-Dimensional Processing-In-Memory-Architectures: A Holistic Tool For Modeling And Simulation

    Get PDF
    Die gemeinhin als Memory Wall bekannte, sich stetig weitende Leistungslรผcke zwischen Prozessor- und Speicherarchitekturen erfordert neue Konzepte, um weiterhin eine Skalierung der Rechenleistung zu ermรถglichen. Da Speicher als die Beschrรคnkung innerhalb einer Von-Neumann-Architektur identifiziert wurden, widmet sich die Arbeit dieser Problemstellung. Obgleich dreidimensionale Speicher zu einer Linderung der Memory Wall beitragen kรถnnen, sind diese alleinig fรผr die zukรผnftige Skalierung ungenรผgend. Aufgrund hรถherer Effizienzen stellt die Integration von Rechenkapazitรคt in den Speicher (Processing-In-Memory, PIM) ein vielversprechender Ausweg dar, jedoch existiert ein Mangel an PIM-Simulationsmodellen. Daher wurde ein flexibles Simulationswerkzeug fรผr dreidimensionale Speicherstapel geschaffen, welches zur Modellierung von dreidimensionalen PIM erweitert wurde. Dieses kann Speicherstapel wie etwa Hybrid Memory Cube standardkonform simulieren und bietet zugleich eine hohe Genauigkeit indem auf elementaren Datenpaketen in Kombination mit dem Hardware validierten Simulator BOBSim modelliert wird. Ein eigens entworfener Simulationstaktbaum ermรถglicht zugleich eine schnelle Ausfรผhrung. Messungen weisen im funktionalen Modus eine 100-fache Beschleunigung auf, wohingegen eine Verdoppelung der Ausfรผhrungsgeschwindigkeit mit Taktgenauigkeit erzielt wird. Anhand eines eigens implementierten, binรคrkompatiblen GPU-Beschleunigers wird die Modellierung einer vollstรคndig dreidimensionalen PIM-Architektur demonstriert. Dabei orientieren sich die maximalen Hardwareressourcen an einem PIM-Beschleuniger aus der Literatur. Evaluiert wird einerseits das GPU-Simulationsmodell eigenstรคndig, andererseits als PIM-Verbund jeweils mit Hilfe einer reprรคsentativ gewรคhlten, speicherbeschrรคnkten geophysikalischen Bildverarbeitung. Bei alleiniger Betrachtung des GPU-Simulationsmodells weist dieses eine signifikant gesteigerte Simulationsgeschwindigkeit auf, bei gleichzeitiger Abweichung von 6% gegenรผber dem Verilator-Modell. Nachfolgend werden innerhalb dieser Arbeit unterschiedliche Konfigurationen des integrierten PIM-Beschleunigers evaluiert. Je nach gewรคhlter Konfiguration kann der genutzte Algorithmus entweder bis zu 140GFLOPS an tatsรคchlicher Rechenleistung abrufen oder eine maximale Recheneffizienz von synthetisch 30% bzw. real 24,5% erzielen. Letzteres stellt eine Verdopplung des Stands der Technik dar. Eine anknรผpfende Diskussion erlรคutert eingehend die Resultate.The steadily widening performance gap between processor- and memory-architectures - commonly known as the Memory Wall - requires novel concepts to achieve further scaling in processing performance. As memories were identified as the limitation within a Von-Neumann-architecture, this work addresses this constraining issue. Although three-dimensional memories alleviate the effects of the Memory Wall, the sole utilization of such memories would be insufficient. Due to higher efficiencies, the integration of processing capacity into memories (so-called Processing-In-Memory, PIM) depicts a promising alternative. However, a lack of PIM simulation models still remains. As a consequence, a flexible simulation tool for three-dimensional stacked memories was established, which was extended for modeling three-dimensional PIM architectures. This tool can simulate stacked memories such as Hybrid Memory Cube standard-compliant and simultaneously offers high accuracy by modeling on elementary data packets (FLIT) in combination with the hardware validated BOBSim simulator. To this, a specifically designed simulation clock tree enables an rapid simulation execution. A 100x speed up in simulation execution can be measured while utilizing the functional mode, whereas a 2x speed up is achieved during clock-cycle accuracy mode. With the aid of a specifically implemented, binary compatible GPU accelerator and the established tool, the modeling of a holistic three-dimensional PIM architecture is demonstrated within this work. Hardware resources used were constrained by a PIM architecture from literature. A representative, memory-bound, geophysical imaging algorithm was leveraged to evaluate the GPU model as well as the compound PIM simulation model. The sole GPU simulation model depicts a significantly improved simulation performance with a deviation of 6% compared to a Verilator model. Subsequently, various PIM accelerator configurations with the integrated GPU model were evaluated. Depending on the chosen PIM configuration, the utilized algorithm achieves 140GFLOPS of processing performance or a maximum computing efficiency of synthetically 30% or realistically 24.5%. The latter depicts a 2x improvement compared to state-of-the-art. A following discussion showcases the results in depth
    corecore