10,485 research outputs found
Memory and information processing in neuromorphic systems
A striking difference between brain-inspired neuromorphic processors and
current von Neumann processors architectures is the way in which memory and
processing is organized. As Information and Communication Technologies continue
to address the need for increased computational power through the increase of
cores within a digital processor, neuromorphic engineers and scientists can
complement this need by building processor architectures where memory is
distributed with the processing. In this paper we present a survey of
brain-inspired processor architectures that support models of cortical networks
and deep neural networks. These architectures range from serial clocked
implementations of multi-neuron systems to massively parallel asynchronous ones
and from purely digital systems to mixed analog/digital systems which implement
more biological-like models of neurons and synapses together with a suite of
adaptation and learning mechanisms analogous to the ones found in biological
nervous systems. We describe the advantages of the different approaches being
pursued and present the challenges that need to be addressed for building
artificial neural processing systems that can display the richness of behaviors
seen in biological systems.Comment: Submitted to Proceedings of IEEE, review of recently proposed
neuromorphic computing platforms and system
Scaling of a large-scale simulation of synchronous slow-wave and asynchronous awake-like activity of a cortical model with long-range interconnections
Cortical synapse organization supports a range of dynamic states on multiple
spatial and temporal scales, from synchronous slow wave activity (SWA),
characteristic of deep sleep or anesthesia, to fluctuating, asynchronous
activity during wakefulness (AW). Such dynamic diversity poses a challenge for
producing efficient large-scale simulations that embody realistic metaphors of
short- and long-range synaptic connectivity. In fact, during SWA and AW
different spatial extents of the cortical tissue are active in a given timespan
and at different firing rates, which implies a wide variety of loads of local
computation and communication. A balanced evaluation of simulation performance
and robustness should therefore include tests of a variety of cortical dynamic
states. Here, we demonstrate performance scaling of our proprietary Distributed
and Plastic Spiking Neural Networks (DPSNN) simulation engine in both SWA and
AW for bidimensional grids of neural populations, which reflects the modular
organization of the cortex. We explored networks up to 192x192 modules, each
composed of 1250 integrate-and-fire neurons with spike-frequency adaptation,
and exponentially decaying inter-modular synaptic connectivity with varying
spatial decay constant. For the largest networks the total number of synapses
was over 70 billion. The execution platform included up to 64 dual-socket
nodes, each socket mounting 8 Intel Xeon Haswell processor cores @ 2.40GHz
clock rates. Network initialization time, memory usage, and execution time
showed good scaling performances from 1 to 1024 processes, implemented using
the standard Message Passing Interface (MPI) protocol. We achieved simulation
speeds of between 2.3x10^9 and 4.1x10^9 synaptic events per second for both
cortical states in the explored range of inter-modular interconnections.Comment: 22 pages, 9 figures, 4 table
Ithemal: Accurate, Portable and Fast Basic Block Throughput Estimation using Deep Neural Networks
Predicting the number of clock cycles a processor takes to execute a block of
assembly instructions in steady state (the throughput) is important for both
compiler designers and performance engineers. Building an analytical model to
do so is especially complicated in modern x86-64 Complex Instruction Set
Computer (CISC) machines with sophisticated processor microarchitectures in
that it is tedious, error prone, and must be performed from scratch for each
processor generation. In this paper we present Ithemal, the first tool which
learns to predict the throughput of a set of instructions. Ithemal uses a
hierarchical LSTM--based approach to predict throughput based on the opcodes
and operands of instructions in a basic block. We show that Ithemal is more
accurate than state-of-the-art hand-written tools currently used in compiler
backends and static machine code analyzers. In particular, our model has less
than half the error of state-of-the-art analytical models (LLVM's llvm-mca and
Intel's IACA). Ithemal is also able to predict these throughput values just as
fast as the aforementioned tools, and is easily ported across a variety of
processor microarchitectures with minimal developer effort.Comment: Published at 36th International Conference on Machine Learning (ICML)
201
Scaling of a large-scale simulation of synchronous slow-wave and asynchronous awake-like activity of a cortical model with long-range interconnections
Cortical synapse organization supports a range of dynamic states on multiple
spatial and temporal scales, from synchronous slow wave activity (SWA),
characteristic of deep sleep or anesthesia, to fluctuating, asynchronous
activity during wakefulness (AW). Such dynamic diversity poses a challenge for
producing efficient large-scale simulations that embody realistic metaphors of
short- and long-range synaptic connectivity. In fact, during SWA and AW
different spatial extents of the cortical tissue are active in a given timespan
and at different firing rates, which implies a wide variety of loads of local
computation and communication. A balanced evaluation of simulation performance
and robustness should therefore include tests of a variety of cortical dynamic
states. Here, we demonstrate performance scaling of our proprietary Distributed
and Plastic Spiking Neural Networks (DPSNN) simulation engine in both SWA and
AW for bidimensional grids of neural populations, which reflects the modular
organization of the cortex. We explored networks up to 192x192 modules, each
composed of 1250 integrate-and-fire neurons with spike-frequency adaptation,
and exponentially decaying inter-modular synaptic connectivity with varying
spatial decay constant. For the largest networks the total number of synapses
was over 70 billion. The execution platform included up to 64 dual-socket
nodes, each socket mounting 8 Intel Xeon Haswell processor cores @ 2.40GHz
clock rates. Network initialization time, memory usage, and execution time
showed good scaling performances from 1 to 1024 processes, implemented using
the standard Message Passing Interface (MPI) protocol. We achieved simulation
speeds of between 2.3x10^9 and 4.1x10^9 synaptic events per second for both
cortical states in the explored range of inter-modular interconnections.Comment: 22 pages, 9 figures, 4 table
Investigation of LSTM Based Prediction for Dynamic Energy Management in Chip Multiprocessors
In this paper, we investigate the effectiveness of using long short-term memory (LSTM) instead of Kalman filtering to do prediction for the purpose of constructing dynamic energy management (DEM) algorithms in chip multi-processors (CMPs). Either of the two prediction methods is employed to estimate the workload in the next control period for each of the processor cores. These estimates are then used to select voltage-frequency (VF) pairs for each core of the CMP during the next control period as part of a dynamic voltage and frequency scaling (DVFS) technique. The objective of the DVFS technique is to reduce energy consumption under performance constraints that are set by the user. We conduct our investigation using a custom Sniper system simulation framework. Simulation results for 16 and 64 core network-on-chip based CMP architectures and using several benchmarks demonstrate that the LSTM is slightly better than Kalman filtering
Investigation of LSTM Based Prediction for Dynamic Energy Management in Chip Multiprocessors
In this paper, we investigate the effectiveness of using long short-term memory (LSTM) instead of Kalman filtering to do prediction for the purpose of constructing dynamic energy management (DEM) algorithms in chip multi-processors (CMPs). Either of the two prediction methods is employed to estimate the workload in the next control period for each of the processor cores. These estimates are then used to select voltage-frequency (VF) pairs for each core of the CMP during the next control period as part of a dynamic voltage and frequency scaling (DVFS) technique. The objective of the DVFS technique is to reduce energy consumption under performance constraints that are set by the user. We conduct our investigation using a custom Sniper system simulation framework. Simulation results for 16 and 64 core network-on-chip based CMP architectures and using several benchmarks demonstrate that the LSTM is slightly better than Kalman filtering
NullHop: A Flexible Convolutional Neural Network Accelerator Based on Sparse Representations of Feature Maps
Convolutional neural networks (CNNs) have become the dominant neural network
architecture for solving many state-of-the-art (SOA) visual processing tasks.
Even though Graphical Processing Units (GPUs) are most often used in training
and deploying CNNs, their power efficiency is less than 10 GOp/s/W for
single-frame runtime inference. We propose a flexible and efficient CNN
accelerator architecture called NullHop that implements SOA CNNs useful for
low-power and low-latency application scenarios. NullHop exploits the sparsity
of neuron activations in CNNs to accelerate the computation and reduce memory
requirements. The flexible architecture allows high utilization of available
computing resources across kernel sizes ranging from 1x1 to 7x7. NullHop can
process up to 128 input and 128 output feature maps per layer in a single pass.
We implemented the proposed architecture on a Xilinx Zynq FPGA platform and
present results showing how our implementation reduces external memory
transfers and compute time in five different CNNs ranging from small ones up to
the widely known large VGG16 and VGG19 CNNs. Post-synthesis simulations using
Mentor Modelsim in a 28nm process with a clock frequency of 500 MHz show that
the VGG19 network achieves over 450 GOp/s. By exploiting sparsity, NullHop
achieves an efficiency of 368%, maintains over 98% utilization of the MAC
units, and achieves a power efficiency of over 3TOp/s/W in a core area of
6.3mm. As further proof of NullHop's usability, we interfaced its FPGA
implementation with a neuromorphic event camera for real time interactive
demonstrations
- …