3 research outputs found
A transprecision floating-point cluster for efficient near-sensor data analytics
Recent applications in the domain of near-sensor computing require the
adoption of floating-point arithmetic to reconcile high precision results with
a wide dynamic range. In this paper, we propose a multi-core computing cluster
that leverages the fined-grained tunable principles of transprecision computing
to provide support to near-sensor applications at a minimum power budget. Our
design - based on the open-source RISC-V architecture - combines
parallelization and sub-word vectorization with near-threshold operation,
leading to a highly scalable and versatile system. We perform an exhaustive
exploration of the design space of the transprecision cluster on a
cycle-accurate FPGA emulator, with the aim to identify the most efficient
configurations in terms of performance, energy efficiency, and area efficiency.
We also provide a full-fledged software stack support, including a parallel
runtime and a compilation toolchain, to enable the development of end-to-end
applications. We perform an experimental assessment of our design on a set of
benchmarks representative of the near-sensor processing domain, complementing
the timing results with a post place-&-route analysis of the power consumption.
Finally, a comparison with the state-of-the-art shows that our solution
outperforms the competitors in energy efficiency, reaching a peak of 97
Gflop/s/W on single-precision scalars and 162 Gflop/s/W on half-precision
vectors
Accelerating Reduction and Scan Using Tensor Core Units
Driven by deep learning, there has been a surge of specialized processors for
matrix multiplication, referred to as TensorCore Units (TCUs). These TCUs are
capable of performing matrix multiplications on small matrices (usually 4x4 or
16x16) to accelerate the convolutional and recurrent neural networks in deep
learning workloads. In this paper we leverage NVIDIA's TCU to express both
reduction and scan with matrix multiplication and show the benefits -- in terms
of program simplicity, efficiency, and performance. Our algorithm exercises the
NVIDIA TCUs which would otherwise be idle, achieves 89%-98% of peak memory copy
bandwidth, and is orders of magnitude faster (up to 100x for reduction and 3x
for scan) than state-of-the-art methods for small segment sizes -- common in
machine learning and scientific applications. Our algorithm achieves this while
decreasing the power consumption by up to 22% for reduction and16%for scan.Comment: In Proceedings of the ACM International Conference on Supercomputing
(ICS '19