200 research outputs found
At the Locus of Performance: A Case Study in Enhancing CPUs with Copious 3D-Stacked Cache
Over the last three decades, innovations in the memory subsystem were
primarily targeted at overcoming the data movement bottleneck. In this paper,
we focus on a specific market trend in memory technology: 3D-stacked memory and
caches. We investigate the impact of extending the on-chip memory capabilities
in future HPC-focused processors, particularly by 3D-stacked SRAM. First, we
propose a method oblivious to the memory subsystem to gauge the upper-bound in
performance improvements when data movement costs are eliminated. Then, using
the gem5 simulator, we model two variants of LARC, a processor fabricated in
1.5 nm and enriched with high-capacity 3D-stacked cache. With a volume of
experiments involving a board set of proxy-applications and benchmarks, we aim
to reveal where HPC CPU performance could be circa 2028, and conclude an
average boost of 9.77x for cache-sensitive HPC applications, on a per-chip
basis. Additionally, we exhaustively document our methodological exploration to
motivate HPC centers to drive their own technological agenda through enhanced
co-design
The Brain on Low Power Architectures - Efficient Simulation of Cortical Slow Waves and Asynchronous States
Efficient brain simulation is a scientific grand challenge, a
parallel/distributed coding challenge and a source of requirements and
suggestions for future computing architectures. Indeed, the human brain
includes about 10^15 synapses and 10^11 neurons activated at a mean rate of
several Hz. Full brain simulation poses Exascale challenges even if simulated
at the highest abstraction level. The WaveScalES experiment in the Human Brain
Project (HBP) has the goal of matching experimental measures and simulations of
slow waves during deep-sleep and anesthesia and the transition to other brain
states. The focus is the development of dedicated large-scale
parallel/distributed simulation technologies. The ExaNeSt project designs an
ARM-based, low-power HPC architecture scalable to million of cores, developing
a dedicated scalable interconnect system, and SWA/AW simulations are included
among the driving benchmarks. At the joint between both projects is the INFN
proprietary Distributed and Plastic Spiking Neural Networks (DPSNN) simulation
engine. DPSNN can be configured to stress either the networking or the
computation features available on the execution platforms. The simulation
stresses the networking component when the neural net - composed by a
relatively low number of neurons, each one projecting thousands of synapses -
is distributed over a large number of hardware cores. When growing the number
of neurons per core, the computation starts to be the dominating component for
short range connections. This paper reports about preliminary performance
results obtained on an ARM-based HPC prototype developed in the framework of
the ExaNeSt project. Furthermore, a comparison is given of instantaneous power,
total energy consumption, execution time and energetic cost per synaptic event
of SWA/AW DPSNN simulations when executed on either ARM- or Intel-based server
platforms
Multilevel simulation-based co-design of next generation HPC microprocessors
This paper demonstrates the combined use of three simulation tools in support of a co-design methodology for an HPC-focused System-on-a-Chip (SoC) design. The simulation tools make different trade-offs between simulation speed, accuracy and model abstraction level, and are shown to be complementary. We apply the MUSA trace-based simulator for the initial sizing of vector register length, system-level cache (SLC) size and memory bandwidth. It has proven to be very efficient at pruning the design space, as its models enable sufficient accuracy without having to resort to highly detailed simulations. Then we apply gem5, a cycle-accurate micro-architecture simulator, for a more refined analysis of the performance potential of our reference SoC architecture, with models able to capture detailed hardware behavior at the cost of simulation speed. Furthermore, we study the network-on-chip (NoC) topology and IP placements using both gem5 for representative small- to medium-scale configurations and SESAM/VPSim, a transaction-level emulator for larger scale systems with good simulation speed and sufficient architectural details. Overall, we consider several system design concerns, such as processor subsystem sizing and NoC settings. We apply the selected simulation tools, focusing on different levels of abstraction, to study several configurations with various design concerns and evaluate them to guide architectural design and optimization decisions. Performance analysis is carried out with a number of representative benchmarks. The obtained numerical results provide guidance and hints to designers regarding SIMD instruction width, SLC sizing, memory bandwidth as well as the best placement of memory controllers and NoC form factor. Thus, we provide critical insights for efficient design of future HPC microprocessors.This work has been performed in the context of the European Processor Initiative (EPI) project, which has received
funding from the European Unionâs Horizon 2020 research
and innovation program under Grant Agreement â 826647.
A special thanks to Amir Charif and Arief Wicaksana for
their invaluable contributions to the SESAM/VPSim tool in
the initial phases of the EPI project.Peer ReviewedPostprint (author's final draft
Real-time cortical simulations: energy and interconnect scaling on distributed systems
We profile the impact of computation and inter-processor communication on the
energy consumption and on the scaling of cortical simulations approaching the
real-time regime on distributed computing platforms. Also, the speed and energy
consumption of processor architectures typical of standard HPC and embedded
platforms are compared. We demonstrate the importance of the design of
low-latency interconnect for speed and energy consumption. The cost of cortical
simulations is quantified using the Joule per synaptic event metric on both
architectures. Reaching efficient real-time on large scale cortical simulations
is of increasing relevance for both future bio-inspired artificial intelligence
applications and for understanding the cognitive functions of the brain, a
scientific quest that will require to embed large scale simulations into highly
complex virtual or real worlds. This work stands at the crossroads between the
WaveScalES experiment in the Human Brain Project (HBP), which includes the
objective of large scale thalamo-cortical simulations of brain states and their
transitions, and the ExaNeSt and EuroExa projects, that investigate the design
of an ARM-based, low-power High Performance Computing (HPC) architecture with a
dedicated interconnect scalable to million of cores; simulation of deep sleep
Slow Wave Activity (SWA) and Asynchronous aWake (AW) regimes expressed by
thalamo-cortical models are among their benchmarks.Comment: 8 pages, 8 figures, 4 tables, submitted after final publication on
PDP2019 proceedings, corrected final DOI. arXiv admin note: text overlap with
arXiv:1812.04974, arXiv:1804.0344
Investigating applications on the A64FX
The A64FX processor from Fujitsu, being designed for computational simulation
and machine learning applications, has the potential for unprecedented
performance in HPC systems. In this paper, we evaluate the A64FX by
benchmarking against a range of production HPC platforms that cover a number of
processor technologies. We investigate the performance of complex scientific
applications across multiple nodes, as well as single node and mini-kernel
benchmarks. This paper finds that the performance of the A64FX processor across
our chosen benchmarks often significantly exceeds other platforms, even without
specific application optimisations for the processor instruction set or
hardware. However, this is not true for all the benchmarks we have undertaken.
Furthermore, the specific configuration of applications can have an impact on
the runtime and performance experienced
Simulation of networks of spiking neurons: A review of tools and strategies
We review different aspects of the simulation of spiking neural networks. We
start by reviewing the different types of simulation strategies and algorithms
that are currently implemented. We next review the precision of those
simulation strategies, in particular in cases where plasticity depends on the
exact timing of the spikes. We overview different simulators and simulation
environments presently available (restricted to those freely available, open
source and documented). For each simulation tool, its advantages and pitfalls
are reviewed, with an aim to allow the reader to identify which simulator is
appropriate for a given task. Finally, we provide a series of benchmark
simulations of different types of networks of spiking neurons, including
Hodgkin-Huxley type, integrate-and-fire models, interacting with current-based
or conductance-based synapses, using clock-driven or event-driven integration
strategies. The same set of models are implemented on the different simulators,
and the codes are made available. The ultimate goal of this review is to
provide a resource to facilitate identifying the appropriate integration
strategy and simulation tool to use for a given modeling problem related to
spiking neural networks.Comment: 49 pages, 24 figures, 1 table; review article, Journal of
Computational Neuroscience, in press (2007
PCSIM: A Parallel Simulation Environment for Neural Circuits Fully Integrated with Python
The Parallel Circuit SIMulator (PCSIM) is a software package for simulation of neural circuits. It is primarily designed for distributed simulation of large scale networks of spiking point neurons. Although its computational core is written in C++, PCSIM's primary interface is implemented in the Python programming language, which is a powerful programming environment and allows the user to easily integrate the neural circuit simulator with data analysis and visualization tools to manage the full neural modeling life cycle. The main focus of this paper is to describe PCSIM's full integration into Python and the benefits thereof. In particular we will investigate how the automatically generated bidirectional interface and PCSIM's object-oriented modular framework enable the user to adopt a hybrid modeling approach: using and extending PCSIM's functionality either employing pure Python or C++ and thus combining the advantages of both worlds. Furthermore, we describe several supplementary PCSIM packages written in pure Python and tailored towards setting up and analyzing neural simulations
Comparison of neuronal spike exchange methods on a Blue Gene/P supercomputer
For neural network simulations on parallel machines, interprocessor spike communication can be a significant portion of the total simulation time. The performance of several spike exchange methods using a Blue Gene/P (BG/P) supercomputer has been tested with 8â128 K cores using randomly connected networks of up to 32 M cells with 1 k connections per cell and 4 M cells with 10 k connections per cell, i.e., on the order of 4·1010 connections (K is 1024, M is 10242, and k is 1000). The spike exchange methods used are the standard Message Passing Interface (MPI) collective, MPI_Allgather, and several variants of the non-blocking Multisend method either implemented via non-blocking MPI_Isend, or exploiting the possibility of very low overhead direct memory access (DMA) communication available on the BG/P. In all cases, the worst performing method was that using MPI_Isend due to the high overhead of initiating a spike communication. The two best performing methodsâthe persistent Multisend method using the Record-Replay feature of the Deep Computing Messaging Framework DCMF_Multicast; and a two-phase multisend in which a DCMF_Multicast is used to first send to a subset of phase one destination cores, which then pass it on to their subset of phase two destination coresâhad similar performance with very low overhead for the initiation of spike communication. Departure from ideal scaling for the Multisend methods is almost completely due to load imbalance caused by the large variation in number of cells that fire on each processor in the interval between synchronization. Spike exchange time itself is negligible since transmission overlaps with computation and is handled by a DMA controller. We conclude that ideal performance scaling will be ultimately limited by imbalance between incoming processor spikes between synchronization intervals. Thus, counterintuitively, maximization of load balance requires that the distribution of cells on processors should not reflect neural net architecture but be randomly distributed so that sets of cells which are burst firing together should be on different processors with their targets on as large a set of processors as possible
- âŠ