455 research outputs found
Massive Data-Centric Parallelism in the Chiplet Era
Traditionally, massively parallel applications are executed on distributed
systems, where computing nodes are distant enough that the parallelization
schemes must minimize communication and synchronization to achieve scalability.
Mapping communication-intensive workloads to distributed systems requires
complicated problem partitioning and dataset pre-processing. With the current
AI-driven trend of having thousands of interconnected processors per chip,
there is an opportunity to re-think these communication-bottlenecked workloads.
This bottleneck often arises from data structure traversals, which cause
irregular memory accesses and poor cache locality.
Recent works have introduced task-based parallelization schemes to accelerate
graph traversal and other sparse workloads. Data structure traversals are split
into tasks and pipelined across processing units (PUs). Dalorex demonstrated
the highest scalability (up to thousands of PUs on a single chip) by having the
entire dataset on-chip, scattered across PUs, and executing the tasks at the PU
where the data is local. However, it also raised questions on how to scale to
larger datasets when all the memory is on chip, and at what cost.
To address these challenges, we propose a scalable architecture composed of a
grid of Data-Centric Reconfigurable Array (DCRA) chiplets. Package-time
reconfiguration enables creating chip products that optimize for different
target metrics, such as time-to-solution, energy, or cost, while software
reconfigurations avoid network saturation when scaling to millions of PUs
across many chip packages. We evaluate six applications and four datasets, with
several configurations and memory technologies, to provide a detailed analysis
of the performance, power, and cost of data-local execution at scale. Our
parallelization of Breadth-First-Search with RMAT-26 across a million PUs
reaches 3323 GTEPS
Overview of Swallow --- A Scalable 480-core System for Investigating the Performance and Energy Efficiency of Many-core Applications and Operating Systems
We present Swallow, a scalable many-core architecture, with a current
configuration of 480 x 32-bit processors.
Swallow is an open-source architecture, designed from the ground up to
deliver scalable increases in usable computational power to allow
experimentation with many-core applications and the operating systems that
support them.
Scalability is enabled by the creation of a tile-able system with a
low-latency interconnect, featuring an attractive communication-to-computation
ratio and the use of a distributed memory configuration.
We analyse the energy and computational and communication performances of
Swallow. The system provides 240GIPS with each core consuming 71--193mW,
dependent on workload. Power consumption per instruction is lower than almost
all systems of comparable scale.
We also show how the use of a distributed operating system (nOS) allows the
easy creation of scalable software to exploit Swallow's potential. Finally, we
show two use case studies: modelling neurons and the overlay of shared memory
on a distributed memory system.Comment: An open source release of the Swallow system design and code will
follow and references to these will be added at a later dat
Memory and information processing in neuromorphic systems
A striking difference between brain-inspired neuromorphic processors and
current von Neumann processors architectures is the way in which memory and
processing is organized. As Information and Communication Technologies continue
to address the need for increased computational power through the increase of
cores within a digital processor, neuromorphic engineers and scientists can
complement this need by building processor architectures where memory is
distributed with the processing. In this paper we present a survey of
brain-inspired processor architectures that support models of cortical networks
and deep neural networks. These architectures range from serial clocked
implementations of multi-neuron systems to massively parallel asynchronous ones
and from purely digital systems to mixed analog/digital systems which implement
more biological-like models of neurons and synapses together with a suite of
adaptation and learning mechanisms analogous to the ones found in biological
nervous systems. We describe the advantages of the different approaches being
pursued and present the challenges that need to be addressed for building
artificial neural processing systems that can display the richness of behaviors
seen in biological systems.Comment: Submitted to Proceedings of IEEE, review of recently proposed
neuromorphic computing platforms and system
- …