2 research outputs found
Exploiting Fine-Grain Ordered Parallelism in Dense Matrix Algorithms
Dense linear algebra kernels are critical for wireless applications, and the
oncoming proliferation of 5G only amplifies their importance. Many such matrix
algorithms are inductive, and exhibit ample amounts of fine-grain ordered
parallelism -- when multiple computations flow with fine-grain
producer/consumer dependences, and where the iteration domain is not easily
tileable. Synchronization overheads make multi-core parallelism ineffective and
the non-tileable iterations make the vector-VLIW approach less effective,
especially for the typically modest-sized matrices. Because CPUs and DSPs lose
order-of-magnitude performance/hardware utilization, costly and inflexible
ASICs are often employed in signal processing pipelines. A programmable
accelerator with similar performance/power/area would be highly desirable. We
find that fine-grain ordered parallelism can be exploited by supporting: 1.
fine-grain stream-based communication/synchronization; 2. inductive data-reuse
and memory access patterns; 3. implicit vector-masking for partial vectors; 4.
hardware specialization of dataflow criticality. In this work, we propose,
REVEL, as a next-generation DSP architecture. It supports the above features in
its ISA and microarchitecture, and further uses a novel vector-stream control
paradigm to reduce control overheads. Across a suite of linear algebra kernels,
REVEL outperforms equally provisioned DSPs by 4.6x-37x in latency and achieves
a performance per mm 2 of 8.3x. It is only 2.2x higher power to achieve the
same performance as ideal ASICs, at about 55% of the combined area
Inter-thread Communication in Multithreaded, Reconfigurable Coarse-grain Arrays
Traditional von Neumann GPGPUs only allow threads to communicate through
memory on a group-to-group basis. In this model, a group of producer threads
writes intermediate values to memory, which are read by a group of consumer
threads after a barrier synchronization. To alleviate the memory bandwidth
imposed by this method of communication, GPGPUs provide a small scratchpad
memory that prevents intermediate values from overloading DRAM bandwidth. In
this paper we introduce direct inter-thread communications for massively
multithreaded CGRAs, where intermediate values are communicated directly
through the compute fabric on a point-to-point basis. This method avoids the
need to write values to memory, eliminates the need for a dedicated scratchpad,
and avoids workgroup-global barriers. The paper introduces the programming
model (CUDA) and execution model extensions, as well as the hardware primitives
that facilitate the communication. Our simulations of Rodinia benchmarks
running on the new system show that direct inter-thread communication provides
an average speedup of 4.5x (13.5x max) and reduces system power by an average
of 7x (33x max), when compared to an equivalent Nvidia GPGPU