4,436 research outputs found
Interval simulation: raising the level of abstraction in architectural simulation
Detailed architectural simulators suffer from a long development cycle and extremely long evaluation times. This longstanding problem is further exacerbated in the multi-core processor era. Existing solutions address the simulation problem by either sampling the simulated instruction stream or by mapping the simulation models on FPGAs; these approaches achieve substantial simulation speedups while simulating performance in a cycle-accurate manner This paper proposes interval simulation which rakes a completely different approach: interval simulation raises the level of abstraction and replaces the core-level cycle-accurate simulation model by a mechanistic analytical model. The analytical model estimates core-level performance by analyzing intervals, or the timing between two miss events (branch mispredictions and TLB/cache misses); the miss events are determined through simulation of the memory hierarchy, cache coherence protocol, interconnection network and branch predictor By raising the level of abstraction, interval simulation reduces both development time and evaluation time. Our experimental results using the SPEC CPU2000 and PARSEC benchmark suites and the MS multi-core simulator show good accuracy up to eight cores (average error of 4.6% and max error of 11% for the multi-threaded full-system workloads), while achieving a one order of magnitude simulation speedup compared to cycle-accurate simulation. Moreover interval simulation is easy to implement: our implementation of the mechanistic analytical model incurs only one thousand lines of code. Its high accuracy, fast simulation speed and ease-of-use make interval simulation a useful complement to the architect's toolbox for exploring system-level and high-level micro-architecture trade-offs
PyCARL: A PyNN Interface for Hardware-Software Co-Simulation of Spiking Neural Network
We present PyCARL, a PyNN-based common Python programming interface for
hardware-software co-simulation of spiking neural network (SNN). Through
PyCARL, we make the following two key contributions. First, we provide an
interface of PyNN to CARLsim, a computationally-efficient, GPU-accelerated and
biophysically-detailed SNN simulator. PyCARL facilitates joint development of
machine learning models and code sharing between CARLsim and PyNN users,
promoting an integrated and larger neuromorphic community. Second, we integrate
cycle-accurate models of state-of-the-art neuromorphic hardware such as
TrueNorth, Loihi, and DynapSE in PyCARL, to accurately model hardware latencies
that delay spikes between communicating neurons and degrade performance. PyCARL
allows users to analyze and optimize the performance difference between
software-only simulation and hardware-software co-simulation of their machine
learning models. We show that system designers can also use PyCARL to perform
design-space exploration early in the product development stage, facilitating
faster time-to-deployment of neuromorphic products. We evaluate the memory
usage and simulation time of PyCARL using functionality tests, synthetic SNNs,
and realistic applications. Our results demonstrate that for large SNNs, PyCARL
does not lead to any significant overhead compared to CARLsim. We also use
PyCARL to analyze these SNNs for a state-of-the-art neuromorphic hardware and
demonstrate a significant performance deviation from software-only simulations.
PyCARL allows to evaluate and minimize such differences early during model
development.Comment: 10 pages, 25 figures. Accepted for publication at International Joint
Conference on Neural Networks (IJCNN) 202
Near-Memory Address Translation
Memory and logic integration on the same chip is becoming increasingly cost
effective, creating the opportunity to offload data-intensive functionality to
processing units placed inside memory chips. The introduction of memory-side
processing units (MPUs) into conventional systems faces virtual memory as the
first big showstopper: without efficient hardware support for address
translation MPUs have highly limited applicability. Unfortunately, conventional
translation mechanisms fall short of providing fast translations as
contemporary memories exceed the reach of TLBs, making expensive page walks
common.
In this paper, we are the first to show that the historically important
flexibility to map any virtual page to any page frame is unnecessary in today's
servers. We find that while limiting the associativity of the
virtual-to-physical mapping incurs no penalty, it can break the
translate-then-fetch serialization if combined with careful data placement in
the MPU's memory, allowing for translation and data fetch to proceed
independently and in parallel. We propose the Distributed Inverted Page Table
(DIPTA), a near-memory structure in which the smallest memory partition keeps
the translation information for its data share, ensuring that the translation
completes together with the data fetch. DIPTA completely eliminates the
performance overhead of translation, achieving speedups of up to 3.81x and
2.13x over conventional translation using 4KB and 1GB pages respectively.Comment: 15 pages, 9 figure
Asynchronous spiking neurons, the natural key to exploit temporal sparsity
Inference of Deep Neural Networks for stream signal (Video/Audio) processing in edge devices is still challenging. Unlike the most state of the art inference engines which are efficient for static signals, our brain is optimized for real-time dynamic signal processing. We believe one important feature of the brain (asynchronous state-full processing) is the key to its excellence in this domain. In this work, we show how asynchronous processing with state-full neurons allows exploitation of the existing sparsity in natural signals. This paper explains three different types of sparsity and proposes an inference algorithm which exploits all types of sparsities in the execution of already trained networks. Our experiments in three different applications (Handwritten digit recognition, Autonomous Steering and Hand-Gesture recognition) show that this model of inference reduces the number of required operations for sparse input data by a factor of one to two orders of magnitudes. Additionally, due to fully asynchronous processing this type of inference can be run on fully distributed and scalable neuromorphic hardware platforms
- …