487 research outputs found

    Prototyping Methodologies and Design of Communication-centric Heterogeneous Many-core Architectures

    Get PDF

    매니코어 NoC 아키텍처에 대한 고속 사이클-근사 시뮬레이션 기법

    Get PDF
    학위논문 (박사)-- 서울대학교 대학원 : 전기·컴퓨터공학부, 2017. 2. 하순회.Simulation is a software technique that uses the current available architecture to prototype a future architecture. In computer architecture research, simulation techniques are one of the most important skills. Simulation techniques enable us to obtain important performance indicators of new architectures and to perform the design space exploration using these metrics. Furthermore, the simulator enables rapid software development and optimization on the architecture that does not exist. Despite various known problems, such as slow speed or coverage issue, the reliance on simulation technology in computer architecture research continues to increase. As the density of transistor increases and the performance improvement of the single core hits the ceiling, the newly constructed architectures usually consist of multi/many cores with the network-on-chip, which enables scalable communications. In addition, the implementation of the application itself has also been complicated to effectively utilize these parallel architectures. Thus, simulators for parallel architectures and parallel applications have become extremely complex, and existing sequential simulators no longer simulate these systems at a realistic time. While many of parallel simulation techniques are being developed to solve these problems, they suffer from poor simulation performance or accuracy. In this thesis, we propose and evaluate a novel many-core simulation technique that can obtain the best simulation performance at the cost of minimum simulation error. The proposed parallel many-core simulator is divided into three parts: 1) core simulator, 2) network-on-chip simulator, and 3) simulation backplane. Each core is executed by a core simulator, which communicates with the external simulation backplane via the Interprocess Communication (IPC). Each core simulation is performed individually in a separate host processor. The simulation backplane arranges messages from each core into chronological order, passes them to destination modules, and simulates hardware components other than cores. If the simulation backplane generates a request requiring NoC communication, this request is forwarded to the network simulator and is simulated at the most accurate accuracy level. In this thesis, we proposed a novel core simulation model, which combined analytical and sampled simulations. The core simulator presents 11.36 to 44.31 MIPS performance, while the simulation error is approximately 8 percent. The standalone core simulator is released as an open-source. We confirmed that NoC simulation has a great effect on the reliability of outputs generated from many-core simulation. First, existing flit-level NoC simulators were analyzed at source-code level. Based on the observations, various implementations were evaluated and various software optimizations was applied to improve the network simulation performance. The proposed NoC simulator presents more than 100KCycles/s performance unless the packet injection rate exceeds 0.00625, which is two times faster than state-of-the-arts NoC simulator at least. The speed of the simulation backplane depends greatly on the IPC overhead and SystemC scheduling overhead. To reduce the IPC overhead, the trace-driven co-simulation technique is used, faster IPC is introduced, and the segmented L1 data cache is embedded in a core simulator. In addition, to reduce SystemC scheduling overhead, it is important to reduce the number of modules that are simultaneously awakened. To this end, slave modules are redesigned to be activated only based on an event. A new scheduler parallelization technique is also studied. Although the newly developed SystemC parallel scheduler showed good performance under limited conditions, we also confirmed that no performance improvement was found in the TLM level many-core simulator developed in this thesis. While the proposed many-core simulator uses the conservative synchronization technique which is free from causality errors and performs an accurate flit-level NoC simulation, the simulation performance is still acceptable, thanks to parallelism and optimizations. Additionally, the simulator is highly scalable to add other modules because the simulation backplane is developed to be compatible with SystemC TLM 2.0 standard. Although extensive experiments on accuracy are not conducted, it will be complemented when a detailed specification of the target architecture is given. This dissertation can be a reference to the development of a many-core simulator, which will be more essential in the future.Chapter 1 Introduction 1 1.1 Motivation 1 1.2 Contribution 4 1.3 Dissertation Organization 5 Chapter 2 Background and Existing Research 6 2.1 Terminologies 6 2.1.1 Simulation Host / Simulation Target 6 2.1.2 Simulated Time / Simulation Time 2.1.3 User-level Simulation / Full-system Simulation 7 2.1.4 Execution-driven Simulation / Trace-driven Simulation 7 2.2 State-of-the-arts Many-core Simulators 8 2.2.1 Gem5 8 2.2.2 Marss 9 2.2.3 Sniper 9 2.2.4 Zsim 9 2.2.5 Manifold 10 2.2.6 Hornet 10 2.2.7 Summary 11 2.3 Host and Target Architecture 12 Chapter 3 Core Simulation 14 3.1 Overview 14 3.2 Related Works 16 3.2.1 Timing Models 16 3.2.2 Analytical Model: Interval Simulation 19 3.3 Sampling Mechanism 23 3.3.1 Sampling Configuration 24 3.3.2 Parameter Extraction 24 3.4 Trace Analyzer 27 3.4.1 Dependency Analysis 29 3.4.2 Life Cycle of An Instruction 31 3.5 Experimental Results 32 3.5.1 Time-accuracy Trade-off 34 3.5.2 Simulation Accuracy 37 3.5.3 Simulation Performance 41 3.6 Discussion 42 Chapter 4 NoC Simulation 45 4.1 Network-on-chip 45 4.2 Motivation 46 4.3 Related Works 48 4.3.1 Noxim 49 4.3.2 Booksim2 50 4.3.3 Garnet 51 4.4 Proposed Approach 51 4.4.1 Implementations 51 4.4.2 Optimizations 54 4.5 Experimental Results 56 4.5.1 Impact of Implementations and Optimizations 56 4.5.2 Comparison with Other State-Of-The-Arts 58 4.5.3 Performance Evaluation For Various Configurations 59 4.5.4 Full-System Simulation Accuracy Impact 59 4.5.5 Accuracy 61 4.6 Discussion 61 Chapter 5 Simulation Backplane 63 5.1 Overview 63 5.2 Background 65 5.2.1 SystemC 65 5.2.2 OSCI Transaction Level Modeling Standard 2.0 66 5.2.3 Synchronization Techniques 67 5.3 SystemC Models for the Target Architecture 69 5.4 Reducing the Cost of Interprocess Communications 71 5.4.1 Trace-driven Co-simulation 71 5.4.2 Better Interprocess Communication 73 5.4.3 Virtually embedding modules to core simulator 74 5.5 Reducing SystemC Scheduling Overhead 76 5.5.1 Event-based Slave Module Activation 76 5.5.2 SystemC Scheduler Parallelization 78 5.6 Evaluation 79 5.6.1 Scalability Test 79 5.6.2 Simulation Performance 79 5.6.3 Simulation Accuracy 80 Chapter 6 Simulation Backplane Parallelization 81 6.1 Background: OSCI SystemC Scheduler 81 6.2 Related Work: SystemC Parallelization Techniques 82 6.2.1 Fully-synchronous Approach 82 6.2.2 Parallel Distributed Event Scheduling (PDES) Approach 82 6.2.3 Out-of-order Execution with Dependency Analysis 83 6.2.4 Dynamic Offloading Approach 84 6.3 Proposed Technique 84 6.3.1 Basic Synchronization 85 6.3.2 Relaxed Synchronization 86 6.3.3 Modeling Restrictions 88 6.4 Experimental Results 89 6.4.1 Performance 90 6.4.2 Accuracy 92 6.5 Discussion and Limitation 93 Chapter 7 Conclusion 95 Bibliography 97 요약 107Docto

    SimuBoost: Scalable Parallelization of Functional System Simulation

    Get PDF
    The limited execution speed of current full system simulators restricts their applicability for dynamic analysis to shortrunning workloads. When analyzing memory contents while simulating a kernel build with Simics, we encountered slowdowns of more than 5000x resulting in 10months of total simulation time. Prior work improved the simulation speed by simulating virtual CPU cores on separate physical CPU cores simultaneously or by applying sampling and extrapolation methods to focus costly analyses on short execution windows. However, these approaches inherently su er from limited scalability or trading accuracy for speed. SimuBoost is a novel idea to parallelize functional full system simulation of single-cores. Our approach takes advantage of fast execution through virtualization, taking checkpoints in regular intervals. The parts between subsequent checkpoints are then simulated and analyzed simultaneously in one job per interval. By transferring jobs to multiple nodes, a parallelized and distributed simulation of the target workload can be achieved, thereby e ectively reducing the overall required simulation time. As no implementation of SimuBoost exists yet, we present a formal model to evaluate the general speedup and scalability characteristics of our acceleration technique. We moreover provide a model to estimate the required number of simulation nodes for optimal performance. According to this model, our approach can speed up conventional simulation in a realistic scenario by a factor of 84, while delivering a parallelization efficiency of 94%

    SIMULATION OF MANYCORE ARCHITECTURES ON MULTICORE HOSTS

    Get PDF
    Computer architects heavily rely on software simulation to evaluate new and existing processor designs. As target designs become more complex, a growing gap has emerged between single-threaded simulator performance and simulation requirements. Even though modern machines feature multiple cores, most host cores are typically unused or underutilized by state-of-the-art simulators. Parallel simulators are inherently limited by their need to synchronize threads for correctness. In my thesis, I study accurate and efficient parallelization techniques for architecture simulation. This thesis contains several contributions. First, I study synchronization between simulator threads simulating homogeneous hardware structures such as cores or network tiles. Based on this study, I introduce a new synchronization policy, weighted-tuple synchronization, and show that it provides a better performance-accuracy trade-off compared to synchronization currently used by state-of-the-art parallel simulators. Next, I study synchronization between separate simulators responsible for modeling heterogeneous components and introduce reciprocal abstraction. Reciprocal abstraction allows asynchronous simulators to exchange information at runtime for more accurate event timing. Lastly, the reciprocal abstraction model relaxes communication latency restrictions and synchronization requirements; I show how relaxed synchronization requirements allows for coprocessor acceleration

    Kiihdytetyn laskennan ajoituksen ja energiankulutuksen simulointi

    Get PDF
    As the increase in the sequential processing performance of general-purpose central processing units has slowed down dramatically, computer systems have been moving towards increasingly parallel and heterogeneous architectures. Modern graphics processing units have emerged as one of the first affordable platforms for data-parallel processing. Due to their closed nature, it has been difficult for software developers to observe the performance and energy efficiency characteristics of the execution of applications of graphics processing units. In this thesis, we have explored different tools and methods for observing the execution of accelerated processing on graphics processing units. We have found that hardware vendors provide interfaces for observing the timing of events that occur on the host platform and aggregated performance metrics of execution on the graphics processing units to some extent. However, more fine-grained details of execution are currently available only by using graphics processing unit simulators. As a proof-of-concept, we have studied a functional graphics processing unit simulator as a tool for understanding the energy efficiency of accelerated processing. The presented energy estimation model and simulation method has been validated against a face detection application. The difference between the estimated and measured dynamic energy consumption in this case was found to be 5.4%. Functional simulators appear to be accurate enough to be used for observing the energy efficiency of graphics processing unit accelerated processing in certain use-cases.Suorittimien sarjallisen suorituskyvyn kasvun hidastuessa tietokonejärjestelmät ovat siirtymässä kohti rinnakkaislaskentaa ja heterogeenisia arkkitehtuureja. Modernit grafiikkasuorittimet ovat yleistyneet ensimmäisinä huokeina alustoina yleisluonteisen kiihdytetyn datarinnakkaisen laskennan suorittamiseen. Grafiikkasuorittimet ovat usein suljettuja alustoja, minkä takia ohjelmistokehittäjien on vaikea havainnoida tarkempia yksityiskohtia suorituksesta liittyen laskennan suorituskykyyn ja energian kulutukseen. Tässä työssä on tutkittu erilaisia työkaluja ja tapoja tarkkailla ohjelmien kiihdytettyä suoritusta grafiikkasuorittimilla. Laitevalmistajat tarjoavat joitakin rajapintoja tapahtumien ajoituksen havainnointiin sekä isäntäalustalla että grafiikkasuorittimella. Laskennan tarkempaan havainnointiin on kuitenkin usein käytettävä grafiikkasuoritinsimulaattoreita. Työn kokeellisessa osuudessa työssä on tutkittu funktionaalisten grafiikkasuoritinsimulaattoreiden käyttöä työkaluna grafiikkasuorittimella kiihdytetyn laskennan energiantehokkuuden arvioinnissa. Työssä on malli grafiikkasuorittimen energian kulutuksen arviontiin. Arvion validointiin on käytetty kasvontunnistussovellusta. Mittauksissa arvioidun ja mitatun energian kulutuksen eroksi mitattiin 5.4%. Funktionaaliset simulaattorit ovat mittaustemme perusteella tietyissä käyttötarkoituksissa tarpeeksi tarkkoja grafiikkasuorittimella kiihdytetyn laskennan energiatehokkuuden arviointiin

    Extensive evaluation of programming models and ISAs impact on multicore soft error reliability

    Get PDF
    To take advantage of the performance enhancements provided by multicore processors, new instruction set architectures (ISAs) and parallel programming libraries have been investigated across multiple industrial segments. It is investigated the impact of parallelization libraries and distinct ISAs on the soft error reliability of two multicore ARM processor models (i.e., Cortex-A9 and CortexA72), running Linux Kernel and benchmarks with up to 87 billion instructions. An extensive soft error evaluation with more than 1.2 million simulation hours, considering ARMv7 and ARMv8 ISAs and the NAS Parallel Benchmark (NPB) suite is presented

    Ultra-Fast and Accurate Simulation for Large-Scale Many-Core Processors

    Get PDF
    The proposed methods enable ultra-fast and accurate emulations of large-scale NoC architectures with up to thousands of nodes using only on-chip resources of a single FPGA. The evaluation results show that more than 5,000× simulation speedup over BookSim, one of the most widely used software-based NoC simulators, is achieved while the simulation accuracy is maintained. i

    Simulation Of Multi-core Systems And Interconnections And Evaluation Of Fat-Mesh Networks

    Get PDF
    Simulators are very important in computer architecture research as they enable the exploration of new architectures to obtain detailed performance evaluation without building costly physical hardware. Simulation is even more critical to study future many-core architectures as it provides the opportunity to assess currently non-existing computer systems. In this thesis, a multiprocessor simulator is presented based on a cycle accurate architecture simulator called SESC. The shared L2 cache system is extended into a distributed shared cache (DSC) with a directory-based cache coherency protocol. A mesh network module is extended and integrated into SESC to replace the bus for scalable inter-processor communication. While these efforts complete an extended multiprocessor simulation infrastructure, two interconnection enhancements are proposed and evaluated. A novel non-uniform fat-mesh network structure similar to the idea of fat-tree is proposed. This non-uniform mesh network takes advantage of the average traffic pattern, typically all-to-all in DSC, to dedicate additional links for connections with heavy traffic (e.g., near the center) and fewer links for lighter traffic (e.g., near the periphery). Two fat-mesh schemes are implemented based on different routing algorithms. Analytical fat-mesh models are constructed by presenting the expressions for the traffic requirements of personalized all-to-all traffic. Performance improvements over the uniform mesh are demonstrated in the results from the simulator. A hybrid network consisting of one packet switching plane and multiple circuit switching planes is constructed as the second enhancement. The circuit switching planes provide fast paths between neighbors with heavy communication traffic. A compiler technique that abstracts the symbolic expressions of benchmarks' communication patterns can be used to help facilitate the circuit establishment
    corecore