91,857 research outputs found

    On the acceleration of wavefront applications using distributed many-core architectures

    Get PDF
    In this paper we investigate the use of distributed graphics processing unit (GPU)-based architectures to accelerate pipelined wavefront applications—a ubiquitous class of parallel algorithms used for the solution of a number of scientific and engineering applications. Specifically, we employ a recently developed port of the LU solver (from the NAS Parallel Benchmark suite) to investigate the performance of these algorithms on high-performance computing solutions from NVIDIA (Tesla C1060 and C2050) as well as on traditional clusters (AMD/InfiniBand and IBM BlueGene/P). Benchmark results are presented for problem classes A to C and a recently developed performance model is used to provide projections for problem classes D and E, the latter of which represents a billion-cell problem. Our results demonstrate that while the theoretical performance of GPU solutions will far exceed those of many traditional technologies, the sustained application performance is currently comparable for scientific wavefront applications. Finally, a breakdown of the GPU solution is conducted, exposing PCIe overheads and decomposition constraints. A new k-blocking strategy is proposed to improve the future performance of this class of algorithm on GPU-based architectures

    Multi-port Memory Design for Advanced Computer Architectures

    Get PDF
    In this thesis, we describe and evaluate novel memory designs for multi-port on-chip and off-chip use in advanced computer architectures. We focus on combining multi-porting and evaluating the performance over a range of design parameters. Multi-porting is essential for caches and shared-data systems, especially multi-core System-on-chips (SOC). It can significantly increase the memory access throughput. We evaluate FinFET voltage-mode multi-port SRAM cells using different metrics including leakage current, static noise margin and read/write performance. Simulation results show that single-ended multi-port FinFET SRAMs with isolated read ports offer improved read stability and flexibility over classical double-ended structures at the expense of write performance. By increasing the size of the access transistors, we show that the single-ended multi-port structures can achieve equivalent write performance to the classical double-ended multi-port structure for 9% area overhead. Moreover, compared with CMOS SRAM, FinFET SRAM has better stability and standby power. We also describe new methods for the design of FinFET current-mode multi-port SRAM cells. Current-mode SRAMs avoid the full-swing of the bitline, reducing dynamic power and access time. However, that comes at the cost of voltage drop, which compromises stability. The design proposed in this thesis utilizes the feature of Independent Gate (IG) mode FinFET, which can leverage threshold voltage by controlling the back gate voltage, to merge two transistors into one through high-Vt and low-Vt transistors. This design not only reduces the voltage drop, but it also reduces the area in multi-port current-mode SRAM design. For off-chip memory, we propose a novel two-port 1-read, 1-write (1R1W) phasechange memory (PCM) cell, which significantly reduces the probability of blocking at the bank levels. Different from the traditional PCM cell, the access transistors are at the top and connected to the bitline. We use Verilog-A to model the behavior of Ge2Sb2Te5 (GST: the storage component). We evaluate the performance of the two-port cell by transistor sizing and voltage pumping. Simulation results show that pMOS transistor is more practical than nMOS transistor as the access device when both area and power are considered. The estimated area overhead is 1.7�, compared to single-port PCM cell. In brief, the contribution we make in this thesis is that we propose and evaluate three different kinds of multi-port memories that are favorable for advanced computer architectures

    Towards Robust Deep Reinforcement Learning for Traffic Signal Control: Demand Surges, Incidents and Sensor Failures

    Full text link
    Reinforcement learning (RL) constitutes a promising solution for alleviating the problem of traffic congestion. In particular, deep RL algorithms have been shown to produce adaptive traffic signal controllers that outperform conventional systems. However, in order to be reliable in highly dynamic urban areas, such controllers need to be robust with the respect to a series of exogenous sources of uncertainty. In this paper, we develop an open-source callback-based framework for promoting the flexible evaluation of different deep RL configurations under a traffic simulation environment. With this framework, we investigate how deep RL-based adaptive traffic controllers perform under different scenarios, namely under demand surges caused by special events, capacity reductions from incidents and sensor failures. We extract several key insights for the development of robust deep RL algorithms for traffic control and propose concrete designs to mitigate the impact of the considered exogenous uncertainties.Comment: 8 page

    Analytic Performance Modeling and Analysis of Detailed Neuron Simulations

    Full text link
    Big science initiatives are trying to reconstruct and model the brain by attempting to simulate brain tissue at larger scales and with increasingly more biological detail than previously thought possible. The exponential growth of parallel computer performance has been supporting these developments, and at the same time maintainers of neuroscientific simulation code have strived to optimally and efficiently exploit new hardware features. Current state of the art software for the simulation of biological networks has so far been developed using performance engineering practices, but a thorough analysis and modeling of the computational and performance characteristics, especially in the case of morphologically detailed neuron simulations, is lacking. Other computational sciences have successfully used analytic performance engineering and modeling methods to gain insight on the computational properties of simulation kernels, aid developers in performance optimizations and eventually drive co-design efforts, but to our knowledge a model-based performance analysis of neuron simulations has not yet been conducted. We present a detailed study of the shared-memory performance of morphologically detailed neuron simulations based on the Execution-Cache-Memory (ECM) performance model. We demonstrate that this model can deliver accurate predictions of the runtime of almost all the kernels that constitute the neuron models under investigation. The gained insight is used to identify the main governing mechanisms underlying performance bottlenecks in the simulation. The implications of this analysis on the optimization of neural simulation software and eventually co-design of future hardware architectures are discussed. In this sense, our work represents a valuable conceptual and quantitative contribution to understanding the performance properties of biological networks simulations.Comment: 18 pages, 6 figures, 15 table

    Piloting mixed reality in ICT networking to visualize complex theoretical multi-step problems

    Get PDF
    This paper presents insights from the implementation of a mixed reality intervention using 3d printed physical objects and a mobile augmented reality application in an ICT networking classroom. The intervention aims to assist student understanding of complex theoretical multi-step problems without a corresponding real world physical analog model. This is important because these concepts are difficult to conceptualise without a corresponding mental model. The simulation works by using physical models to represent networking equipment and allows learners to build a network that can then be simulated using a mobile app to observe underlying packet traversal and routing theory between the different devices as data travels from the source to the destination. Outcomes from usability testing show great student interest in the intervention and a feeling that it helped with clarity, but also demonstrated the need to scaffold the use of the intervention for students rather than providing a freeform experience in the classroom

    4.45 Pflops Astrophysical N-Body Simulation on K computer -- The Gravitational Trillion-Body Problem

    Full text link
    As an entry for the 2012 Gordon-Bell performance prize, we report performance results of astrophysical N-body simulations of one trillion particles performed on the full system of K computer. This is the first gravitational trillion-body simulation in the world. We describe the scientific motivation, the numerical algorithm, the parallelization strategy, and the performance analysis. Unlike many previous Gordon-Bell prize winners that used the tree algorithm for astrophysical N-body simulations, we used the hybrid TreePM method, for similar level of accuracy in which the short-range force is calculated by the tree algorithm, and the long-range force is solved by the particle-mesh algorithm. We developed a highly-tuned gravity kernel for short-range forces, and a novel communication algorithm for long-range forces. The average performance on 24576 and 82944 nodes of K computer are 1.53 and 4.45 Pflops, which correspond to 49% and 42% of the peak speed.Comment: 10 pages, 6 figures, Proceedings of Supercomputing 2012 (http://sc12.supercomputing.org/), Gordon Bell Prize Winner. Additional information is http://www.ccs.tsukuba.ac.jp/CCS/eng/gbp201
    corecore