Search CORE

2 research outputs found

CPU/GPU 이종 병렬 플랫폼을 위한 GPU-in-the-loop 시뮬레이션 기법

Author: 고영섭
Publication venue: 서울대학교 대학원
Publication date: 01/02/2016
Field of study

학위논문 (박사)-- 서울대학교 대학원 : 전기·컴퓨터공학부, 2016. 2. 하순회.복잡한 3D 게임을 처리하거나, 높은 반응성을 가지는 유저인터페이스를 제공하기 위해서, 대부분의 임베디드 시스템에서 모바일 GPU 가 사용되고 있다. 게다가, 모바일 GPU 의 계산 능력이 높아지고, GPU 에 대한 프로그래밍이 가능해짐에 따라, 모바일 GPU 가 하나의 보조 연산 장치로서 여겨지고 있다. 모바일 GPU 의 경우, 서버 환경과 달리 제약된 파워상에서 수행되어야 하므로, 대게 적은 수의 코어를 포함한다. 그러므로, 주어진 성능과 파워 제약 조건을 만족시키기 위해서는 CPU 와 GPU 모두를 효율적으로 활용하는 것이 매우 중요하다. CPU/GPU 이종 병렬 아키텍쳐를 설계하는 초기 단계에서 SW 에 대한 오류를 검출하거나 또는 다양한 설계 공간 탐색을 위해서, 가상 프로토타이핑 시스템을 사용하는 것이 일반적이다. 가상 프로토타이핑 시스템에서는 대상하는 시스템의 모든 구성요소에 대한 시뮬레이션 모델을 포함하므로, CPU 와 GPU 가 포함되는 이종 병렬 아키텍쳐를 위해서는 GPU 에 대한 시뮬레이션 모델이 반드시 필요하다. 그러나 일부 GPU 의 경우, 시뮬레이션 모델이 존재하지 않고, 있는 경우에도 주로 마이크로 아키텍쳐 수준에서의 아키텍쳐 탐색을 위한 목적으로 개발되어, 시뮬레이션 성능이 좋지 않다. 이러한 문제를 해결하기 위해서, 본 논문에서는 실제 하드웨어와 시뮬레이터를 결합하는 GPU-in-the-loop (GIL) 시뮬레이션 기법을 제안하려고 한다. 제안하는 방법의 경우, 다양한 수준에서 CPU 와 GPU 간의 연동이 가능한데, 첫번째 방법으로 시스템 콜 수준에서 시뮬레이터와 GPU 보드 간의 연동하는 기법을 제안한다. 제안하는 기법에서는 대상 시스템에 있는 공유 메모리가 시뮬레이터와 보드 상에 존재하는 서로 다른 두개의 메모리를 통해 시뮬레이션이 되므로, 두 메모리 간의 일관성을 유지하기 위한 메모리 동기화가 가장 중요한 문제이다. 시스템 콜 기반 기법에서 이 문제를 다루기 위해서, 주소 변환 테이블을 통해서 공유 되는 메모리 영역에 대한 정보를 저장하고, 실제 보드 상의 GPU 를 수행시키는 System Call 이 요청될 때마다, 해당 테이블을 이용하여 공유 되는 영역에 대한 동기화가 수행된다. GPU 의 수행을 시뮬레이터 상에서 모델링하기 위해, 인터럽트 기반 모델링 기법을 제안하였는데, 이 기법에서는 보드에서 측정된 GPU 수행시간을 고려하여, 시뮬레이터 상에서 GPU 인터럽트를 발생하도록 한다. 두번째 방법으로 API 수준에서 시뮬레이터와 보드 간의 연동하는 기법을 제안한다. 기존 Software Stack 에 포함된 디바이스 드라이버가 시뮬레이션 되는 경우, 다양한 GPU 를 지원하도록 확장하는 것이 어려우므로, API 기반 기법에서는 시뮬레이션 용도로 사용되는 새로운 라이브러리를 정의하고, 기존 SW stack 상에 존재하는 GPU 라이브러리를 대체하도록 하여, 디바이스 드라이버가 시뮬레이션 되지 않도록 한다. 그리고 API 수행시간을 시뮬레이터 상에서 모델링하기 위해서, 시뮬레이션을 위한 새로운 디바이스 드라이버를 정의하여, 해당 드라이버 내에서 sleep 함수를 호출하여, 보드에서 측정된 API 시간이 시뮬레이터상에 반영되게 된다. 현존하는 GPU API 중에서, 본 논문에서는 가장 많이 사용되는 OpenCL, CUDA 그리고 OpenGL ES API 에 대한 API 기반 시뮬레이션 기법을 제안한다. 그리고 올바른 시뮬레이션을 위해서, 비동기 동작, 멀티프로세스 지원, 복잡한 데이터 구조에 대한 메모리 동기화와 같은 어려운 문제들을 다양한 기법들을 통해 해결하였다. 실험 결과를 통해서, 제안된 기법이 적절한 수준의 정확도를 보장하면서, 빠른 시뮬레이션 성능을 제공할 수 있음을 확인할 수 있다. 그러므로, 제안된 기법은 SW 개발 용도뿐만 아니라, 시스템 수준에서의 성능 예측을 위한 용도로서 사용이 가능하다. 게다가, 제안된 기법의 경우, 실제 하드웨어가 사용되므로, GPU 에 대한 시뮬레이터가 제공되지 않는 경우에도 CPU/GPU 이종 병렬 시스템을 위한 가상 프로토타이핑 시스템을 구축하는 것이 가능하다.A mobile GPU has been widely adopted in most embedded systems to handle the complex graphics computations required in modern 3D games and highly interactive UI (User Interface). Moreover, as mobile GPUs are gaining more computation power and becoming increasingly programmable, they are also used to accelerate general-purpose computations in various fields such as physics and math, and so on. Unlike server GPUs, mobile GPUs usually have fewer cores since a limited amount of power is available in a battery. Thus, it is important to efficiently utilize both CPUs and GPUs in mobile platforms to satisfy the performance and power constraints. For design space exploration of such a CPU-GPU heterogeneous architecture or debugging the SW in the early design stage, a full system simulator is typically used, in which simulation models of all HW components in the target system is included. Unfortunately, building a full system simulator with GPU simulator is not always possible because there is no available GPU simulator, or if any, it is prohibitively slow since they are mainly developed for architecture exploration varying the internal micro-architecture of GPUs. To solve these problems, this thesis proposes a GPU-in-the-loop (GIL) simulation technique that integrates a real GPU with a full system simulator for CPU/GPU heterogeneous platforms. In the first part of this thesis, we propose a system call-level simulation technique in which a full system simulator interacts with a GPU board at system call level. Since the shared on-chip memory in the target system is modeled by two separate memories in the simulator and the board, memory synchronization is the most challenging problem in the proposed technique. To handle this problem in the system call-level technique, address translation tables are maintained for the shared memory regions and these memory regions are synchronized whenever the system calls which trigger the GPU execution are invoked in the board. To model the GPU execution in the simulator, interrupt-based modeling technique is proposed, in which the GPU interrupt is generated in consideration of the GPU execution time obtained from the real board. In the second part of this thesis, we propose an API-level simulation technique in which a simulator and a board interact with each other at API level. Since the device driver in the original software stack makes it difficult to support various GPUs, a synthetic library is defined and it replaces the GPU library in the original software stack in order to ensure that the device driver is not used. To model timing of the API execution in the simulator, the sleep function is called in the synthetic driver so that the measured API time in the board elapses in the simulated time. From the existing GPU APIs, we propose API-level simulation techniques for three commonly used APIs which are OpenCL, CUDA and OpenGL ES. And several challenging problems such as asynchronous behavior, multi-process support and memory synchronization for complex data structures are properly handled by several methods for correct simulation. From the experimental results, we can confirm that the proposed technique can provide fast simulation speed with a reasonable timing accuracy. Therefore, it can be used not only for SW development but also for system level performance estimation. Moreover, the proposed technique makes the full system simulation for CPU/GPU heterogeneous platforms feasible even if a GPU simulator is not available.Chapter 1 Introduction 1 1.1 Motivation 1 1.2 Contribution 4 1.3 Thesis Organization 6 Chapter 2 Related Works 7 2.1 Acceleration techniques for GPU simulation 7 2.1.1 Parallel Simulation 8 2.1.2 Sampled Simulation 9 2.1.3 Statistical Simulation 11 2.1.4 HW-accelerated Simulation 11 2.2 CPU/GPU Simulation framework 12 2.3 Summary 15 Chapter 3 GPU-in-the-loop Simulation 18 3.1 Basic Idea 18 3.2 Different levels of CPU/GPU Interaction 20 3.3 Detection Mechanism 21 3.4 Memory Coherency Problem 23 3.5 Overall GIL simulation flow 23 Chapter 4 System call- level GIL Simulation 26 4.1 Target System 26 4.1.1 Typical Execution Scenario of the Systems 27 4.2 Memory Synchronization 29 4.2.1 Address Translation Table 30 4.3 Timing Modeling 32 4.3.1 Interrupt Modeling 33 4.3.2 Regression based timing correction for GPU time 34 4.3.3 An Example of System-level GIL Simulation Scenario 35 4.4 Experiments 37 4.4.1 Parallelization for diff operation 37 4.4.2 Simulation Time Analysis 39 4.4.3 Contention overhead in Pixel Processors (PP) 40 4.4.4 Internal System Behavior Profiling 41 4.4.5 Accuracy Evaluation 42 4.5 Summary 43 Chapter 5 API-Level GIL Simulation 44 5.1 Differences between API-level and System call-level techniques 45 5.1.1 Synthetic Library 47 5.2 Timing Modeling 49 5.2.1 Regression-based compensation for timing error 51 5.3 Memory Synchronization 52 5.4 GPGPU API (CUDA & OpenCL) Implementation Case 55 5.4.1 Asynchronous Behavior Modeling 55 5.4.2 Implementation Issues 58 5.4.3 Experiments 61 5.4.4 Simulation Overhead 68 5.5 OpenGL ES Implementation Case 69 5.5.1 Background 69 5.5.2 Additional modification for SW stack 71 5.5.3 Memory synchronization 72 5.5.4 Multi-Process Support 77 5.5.5 High-level Timing Modeling for other GPUs 79 5.5.6 Porting To a New GPU Board 81 5.5.7 Experiments 83 5.6 Summary 92 Chapter 6 Conclusion and Future Work 94 Bibliography 98 초록 105Docto

SNU Open Repository and Archive

Three-Dimensional Processing-In-Memory-Architectures: A Holistic Tool For Modeling And Simulation

Author: Siegl Patrick Daniel Marcus
Publication venue
Publication date: 01/01/2018
Field of study

Die gemeinhin als Memory Wall bekannte, sich stetig weitende Leistungslücke zwischen Prozessor- und Speicherarchitekturen erfordert neue Konzepte, um weiterhin eine Skalierung der Rechenleistung zu ermöglichen. Da Speicher als die Beschränkung innerhalb einer Von-Neumann-Architektur identifiziert wurden, widmet sich die Arbeit dieser Problemstellung. Obgleich dreidimensionale Speicher zu einer Linderung der Memory Wall beitragen können, sind diese alleinig für die zukünftige Skalierung ungenügend. Aufgrund höherer Effizienzen stellt die Integration von Rechenkapazität in den Speicher (Processing-In-Memory, PIM) ein vielversprechender Ausweg dar, jedoch existiert ein Mangel an PIM-Simulationsmodellen. Daher wurde ein flexibles Simulationswerkzeug für dreidimensionale Speicherstapel geschaffen, welches zur Modellierung von dreidimensionalen PIM erweitert wurde. Dieses kann Speicherstapel wie etwa Hybrid Memory Cube standardkonform simulieren und bietet zugleich eine hohe Genauigkeit indem auf elementaren Datenpaketen in Kombination mit dem Hardware validierten Simulator BOBSim modelliert wird. Ein eigens entworfener Simulationstaktbaum ermöglicht zugleich eine schnelle Ausführung. Messungen weisen im funktionalen Modus eine 100-fache Beschleunigung auf, wohingegen eine Verdoppelung der Ausführungsgeschwindigkeit mit Taktgenauigkeit erzielt wird. Anhand eines eigens implementierten, binärkompatiblen GPU-Beschleunigers wird die Modellierung einer vollständig dreidimensionalen PIM-Architektur demonstriert. Dabei orientieren sich die maximalen Hardwareressourcen an einem PIM-Beschleuniger aus der Literatur. Evaluiert wird einerseits das GPU-Simulationsmodell eigenständig, andererseits als PIM-Verbund jeweils mit Hilfe einer repräsentativ gewählten, speicherbeschränkten geophysikalischen Bildverarbeitung. Bei alleiniger Betrachtung des GPU-Simulationsmodells weist dieses eine signifikant gesteigerte Simulationsgeschwindigkeit auf, bei gleichzeitiger Abweichung von 6% gegenüber dem Verilator-Modell. Nachfolgend werden innerhalb dieser Arbeit unterschiedliche Konfigurationen des integrierten PIM-Beschleunigers evaluiert. Je nach gewählter Konfiguration kann der genutzte Algorithmus entweder bis zu 140GFLOPS an tatsächlicher Rechenleistung abrufen oder eine maximale Recheneffizienz von synthetisch 30% bzw. real 24,5% erzielen. Letzteres stellt eine Verdopplung des Stands der Technik dar. Eine anknüpfende Diskussion erläutert eingehend die Resultate.The steadily widening performance gap between processor- and memory-architectures - commonly known as the Memory Wall - requires novel concepts to achieve further scaling in processing performance. As memories were identified as the limitation within a Von-Neumann-architecture, this work addresses this constraining issue. Although three-dimensional memories alleviate the effects of the Memory Wall, the sole utilization of such memories would be insufficient. Due to higher efficiencies, the integration of processing capacity into memories (so-called Processing-In-Memory, PIM) depicts a promising alternative. However, a lack of PIM simulation models still remains. As a consequence, a flexible simulation tool for three-dimensional stacked memories was established, which was extended for modeling three-dimensional PIM architectures. This tool can simulate stacked memories such as Hybrid Memory Cube standard-compliant and simultaneously offers high accuracy by modeling on elementary data packets (FLIT) in combination with the hardware validated BOBSim simulator. To this, a specifically designed simulation clock tree enables an rapid simulation execution. A 100x speed up in simulation execution can be measured while utilizing the functional mode, whereas a 2x speed up is achieved during clock-cycle accuracy mode. With the aid of a specifically implemented, binary compatible GPU accelerator and the established tool, the modeling of a holistic three-dimensional PIM architecture is demonstrated within this work. Hardware resources used were constrained by a PIM architecture from literature. A representative, memory-bound, geophysical imaging algorithm was leveraged to evaluate the GPU model as well as the compound PIM simulation model. The sole GPU simulation model depicts a significantly improved simulation performance with a deviation of 6% compared to a Verilator model. Subsequently, various PIM accelerator configurations with the integrated GPU model were evaluated. Depending on the chosen PIM configuration, the utilized algorithm achieves 140GFLOPS of processing performance or a maximum computing efficiency of synthetically 30% or realistically 24.5%. The latter depicts a 2x improvement compared to state-of-the-art. A following discussion showcases the results in depth

Digitale Bibliothek Braunschweig