매니코어 NoC 아키텍처에 대한 고속 사이클-근사 시뮬레이션 기법

Abstract

학위논문 (박사)-- 서울대학교 대학원 : 전기·컴퓨터공학부, 2017. 2. 하순회.Simulation is a software technique that uses the current available architecture to prototype a future architecture. In computer architecture research, simulation techniques are one of the most important skills. Simulation techniques enable us to obtain important performance indicators of new architectures and to perform the design space exploration using these metrics. Furthermore, the simulator enables rapid software development and optimization on the architecture that does not exist. Despite various known problems, such as slow speed or coverage issue, the reliance on simulation technology in computer architecture research continues to increase. As the density of transistor increases and the performance improvement of the single core hits the ceiling, the newly constructed architectures usually consist of multi/many cores with the network-on-chip, which enables scalable communications. In addition, the implementation of the application itself has also been complicated to effectively utilize these parallel architectures. Thus, simulators for parallel architectures and parallel applications have become extremely complex, and existing sequential simulators no longer simulate these systems at a realistic time. While many of parallel simulation techniques are being developed to solve these problems, they suffer from poor simulation performance or accuracy. In this thesis, we propose and evaluate a novel many-core simulation technique that can obtain the best simulation performance at the cost of minimum simulation error. The proposed parallel many-core simulator is divided into three parts: 1) core simulator, 2) network-on-chip simulator, and 3) simulation backplane. Each core is executed by a core simulator, which communicates with the external simulation backplane via the Interprocess Communication (IPC). Each core simulation is performed individually in a separate host processor. The simulation backplane arranges messages from each core into chronological order, passes them to destination modules, and simulates hardware components other than cores. If the simulation backplane generates a request requiring NoC communication, this request is forwarded to the network simulator and is simulated at the most accurate accuracy level. In this thesis, we proposed a novel core simulation model, which combined analytical and sampled simulations. The core simulator presents 11.36 to 44.31 MIPS performance, while the simulation error is approximately 8 percent. The standalone core simulator is released as an open-source. We confirmed that NoC simulation has a great effect on the reliability of outputs generated from many-core simulation. First, existing flit-level NoC simulators were analyzed at source-code level. Based on the observations, various implementations were evaluated and various software optimizations was applied to improve the network simulation performance. The proposed NoC simulator presents more than 100KCycles/s performance unless the packet injection rate exceeds 0.00625, which is two times faster than state-of-the-arts NoC simulator at least. The speed of the simulation backplane depends greatly on the IPC overhead and SystemC scheduling overhead. To reduce the IPC overhead, the trace-driven co-simulation technique is used, faster IPC is introduced, and the segmented L1 data cache is embedded in a core simulator. In addition, to reduce SystemC scheduling overhead, it is important to reduce the number of modules that are simultaneously awakened. To this end, slave modules are redesigned to be activated only based on an event. A new scheduler parallelization technique is also studied. Although the newly developed SystemC parallel scheduler showed good performance under limited conditions, we also confirmed that no performance improvement was found in the TLM level many-core simulator developed in this thesis. While the proposed many-core simulator uses the conservative synchronization technique which is free from causality errors and performs an accurate flit-level NoC simulation, the simulation performance is still acceptable, thanks to parallelism and optimizations. Additionally, the simulator is highly scalable to add other modules because the simulation backplane is developed to be compatible with SystemC TLM 2.0 standard. Although extensive experiments on accuracy are not conducted, it will be complemented when a detailed specification of the target architecture is given. This dissertation can be a reference to the development of a many-core simulator, which will be more essential in the future.Chapter 1 Introduction 1 1.1 Motivation 1 1.2 Contribution 4 1.3 Dissertation Organization 5 Chapter 2 Background and Existing Research 6 2.1 Terminologies 6 2.1.1 Simulation Host / Simulation Target 6 2.1.2 Simulated Time / Simulation Time 2.1.3 User-level Simulation / Full-system Simulation 7 2.1.4 Execution-driven Simulation / Trace-driven Simulation 7 2.2 State-of-the-arts Many-core Simulators 8 2.2.1 Gem5 8 2.2.2 Marss 9 2.2.3 Sniper 9 2.2.4 Zsim 9 2.2.5 Manifold 10 2.2.6 Hornet 10 2.2.7 Summary 11 2.3 Host and Target Architecture 12 Chapter 3 Core Simulation 14 3.1 Overview 14 3.2 Related Works 16 3.2.1 Timing Models 16 3.2.2 Analytical Model: Interval Simulation 19 3.3 Sampling Mechanism 23 3.3.1 Sampling Configuration 24 3.3.2 Parameter Extraction 24 3.4 Trace Analyzer 27 3.4.1 Dependency Analysis 29 3.4.2 Life Cycle of An Instruction 31 3.5 Experimental Results 32 3.5.1 Time-accuracy Trade-off 34 3.5.2 Simulation Accuracy 37 3.5.3 Simulation Performance 41 3.6 Discussion 42 Chapter 4 NoC Simulation 45 4.1 Network-on-chip 45 4.2 Motivation 46 4.3 Related Works 48 4.3.1 Noxim 49 4.3.2 Booksim2 50 4.3.3 Garnet 51 4.4 Proposed Approach 51 4.4.1 Implementations 51 4.4.2 Optimizations 54 4.5 Experimental Results 56 4.5.1 Impact of Implementations and Optimizations 56 4.5.2 Comparison with Other State-Of-The-Arts 58 4.5.3 Performance Evaluation For Various Configurations 59 4.5.4 Full-System Simulation Accuracy Impact 59 4.5.5 Accuracy 61 4.6 Discussion 61 Chapter 5 Simulation Backplane 63 5.1 Overview 63 5.2 Background 65 5.2.1 SystemC 65 5.2.2 OSCI Transaction Level Modeling Standard 2.0 66 5.2.3 Synchronization Techniques 67 5.3 SystemC Models for the Target Architecture 69 5.4 Reducing the Cost of Interprocess Communications 71 5.4.1 Trace-driven Co-simulation 71 5.4.2 Better Interprocess Communication 73 5.4.3 Virtually embedding modules to core simulator 74 5.5 Reducing SystemC Scheduling Overhead 76 5.5.1 Event-based Slave Module Activation 76 5.5.2 SystemC Scheduler Parallelization 78 5.6 Evaluation 79 5.6.1 Scalability Test 79 5.6.2 Simulation Performance 79 5.6.3 Simulation Accuracy 80 Chapter 6 Simulation Backplane Parallelization 81 6.1 Background: OSCI SystemC Scheduler 81 6.2 Related Work: SystemC Parallelization Techniques 82 6.2.1 Fully-synchronous Approach 82 6.2.2 Parallel Distributed Event Scheduling (PDES) Approach 82 6.2.3 Out-of-order Execution with Dependency Analysis 83 6.2.4 Dynamic Offloading Approach 84 6.3 Proposed Technique 84 6.3.1 Basic Synchronization 85 6.3.2 Relaxed Synchronization 86 6.3.3 Modeling Restrictions 88 6.4 Experimental Results 89 6.4.1 Performance 90 6.4.2 Accuracy 92 6.5 Discussion and Limitation 93 Chapter 7 Conclusion 95 Bibliography 97 요약 107Docto

    Similar works