3 research outputs found
A transprecision floating-point cluster for efficient near-sensor data analytics
Recent applications in the domain of near-sensor computing require the
adoption of floating-point arithmetic to reconcile high precision results with
a wide dynamic range. In this paper, we propose a multi-core computing cluster
that leverages the fined-grained tunable principles of transprecision computing
to provide support to near-sensor applications at a minimum power budget. Our
design - based on the open-source RISC-V architecture - combines
parallelization and sub-word vectorization with near-threshold operation,
leading to a highly scalable and versatile system. We perform an exhaustive
exploration of the design space of the transprecision cluster on a
cycle-accurate FPGA emulator, with the aim to identify the most efficient
configurations in terms of performance, energy efficiency, and area efficiency.
We also provide a full-fledged software stack support, including a parallel
runtime and a compilation toolchain, to enable the development of end-to-end
applications. We perform an experimental assessment of our design on a set of
benchmarks representative of the near-sensor processing domain, complementing
the timing results with a post place-&-route analysis of the power consumption.
Finally, a comparison with the state-of-the-art shows that our solution
outperforms the competitors in energy efficiency, reaching a peak of 97
Gflop/s/W on single-precision scalars and 162 Gflop/s/W on half-precision
vectors
Integrated Programmable-Array accelerator to design heterogeneous ultra-low power manycore architectures
There is an ever-increasing demand for energy efficiency (EE) in rapidly evolving Internet-of-Things end nodes. This pushes researchers and engineers to develop solutions that provide both Application-Specific Integrated Circuit-like EE and Field-Programmable Gate Array-like flexibility. One such solution is Coarse Grain Reconfigurable Array (CGRA). Over the past decades, CGRAs have evolved and are competing to become mainstream hardware accelerators, especially for accelerating Digital Signal Processing (DSP) applications. Due to the over-specialization of computing architectures, the focus is shifting towards fitting an extensive data representation range into fewer bits, e.g., a 32-bit space can represent a more extensive data range with floating-point (FP) representation than an integer representation. Computation using FP representation requires numerous encodings and leads to complex circuits for the FP operators, decreasing the EE of the entire system. This thesis presents the design of an EE ultra-low-power CGRA with native support for FP computation by leveraging an emerging paradigm of approximate computing called transprecision computing. We also present the contributions in the compilation toolchain and system-level integration of CGRA in a System-on-Chip, to envision the proposed CGRA as an EE hardware accelerator. Finally, an extensive set of experiments using real-world algorithms employed in near-sensor processing applications are performed, and results are compared with state-of-the-art (SoA) architectures. It is empirically shown that our proposed CGRA provides better results w.r.t. SoA architectures in terms of power, performance, and area