251 research outputs found
hls4ml: An Open-Source Codesign Workflow to Empower Scientific Low-Power Machine Learning Devices
Accessible machine learning algorithms, software, and diagnostic tools for
energy-efficient devices and systems are extremely valuable across a broad
range of application domains. In scientific domains, real-time near-sensor
processing can drastically improve experimental design and accelerate
scientific discoveries. To support domain scientists, we have developed hls4ml,
an open-source software-hardware codesign workflow to interpret and translate
machine learning algorithms for implementation with both FPGA and ASIC
technologies. We expand on previous hls4ml work by extending capabilities and
techniques towards low-power implementations and increased usability: new
Python APIs, quantization-aware pruning, end-to-end FPGA workflows, long
pipeline kernels for low power, and new device backends include an ASIC
workflow. Taken together, these and continued efforts in hls4ml will arm a new
generation of domain scientists with accessible, efficient, and powerful tools
for machine-learning-accelerated discovery.Comment: 10 pages, 8 figures, TinyML Research Symposium 202
FPSA: A Full System Stack Solution for Reconfigurable ReRAM-based NN Accelerator Architecture
Neural Network (NN) accelerators with emerging ReRAM (resistive random access
memory) technologies have been investigated as one of the promising solutions
to address the \textit{memory wall} challenge, due to the unique capability of
\textit{processing-in-memory} within ReRAM-crossbar-based processing elements
(PEs). However, the high efficiency and high density advantages of ReRAM have
not been fully utilized due to the huge communication demands among PEs and the
overhead of peripheral circuits.
In this paper, we propose a full system stack solution, composed of a
reconfigurable architecture design, Field Programmable Synapse Array (FPSA) and
its software system including neural synthesizer, temporal-to-spatial mapper,
and placement & routing. We highly leverage the software system to make the
hardware design compact and efficient. To satisfy the high-performance
communication demand, we optimize it with a reconfigurable routing architecture
and the placement & routing tool. To improve the computational density, we
greatly simplify the PE circuit with the spiking schema and then adopt neural
synthesizer to enable the high density computation-resources to support
different kinds of NN operations. In addition, we provide spiking memory blocks
(SMBs) and configurable logic blocks (CLBs) in hardware and leverage the
temporal-to-spatial mapper to utilize them to balance the storage and
computation requirements of NN. Owing to the end-to-end software system, we can
efficiently deploy existing deep neural networks to FPSA. Evaluations show
that, compared to one of state-of-the-art ReRAM-based NN accelerators, PRIME,
the computational density of FPSA improves by 31x; for representative NNs, its
inference performance can achieve up to 1000x speedup.Comment: Accepted by ASPLOS 201
Recommended from our members
A RISC-V Vector Processor With Simultaneous-Switching Switched-Capacitor DC-DC Converters in 28 nm FDSOI
This work demonstrates a RISC-V vector microprocessor implemented in 28 nm FDSOI with fully integrated simultaneous-switching switched-capacitor DC-DC (SC DC-DC) converters and adaptive clocking that generates four on-chip voltages between 0.45 and 1 V using only 1.0 V core and 1.8 V IO voltage inputs. The converters achieve high efficiency at the system level by switching simultaneously to avoid charge-sharing losses and by using an adaptive clock to maximize performance for the resulting voltage ripple. Details about the implementation of the DC-DC switches, DC-DC controller, and adaptive clock are provided, and the sources of conversion loss are analyzed based on measured results. This system pushes the capabilities of dynamic voltage scaling by enabling fast transitions (20 ns), simple packaging (no off-chip passives), low area overhead (16%), high conversion efficiency (80%-86%), and high energy efficiency (26.2 DP GFLOPS/W) for mobile devices
Characterizing and Optimizing the Performance of the MAESTRO 49-core Processor
As space-based imagery-intelligence systems become increasingly complex, processing units are needed that can process the extra data these systems seek to collect. However, the space environment presents a number of threats, such as ambient or malicious radiation, that can damage and otherwise interfere with electronic systems. There is a need, then, for processors that can tolerate radiation-induced faults, and that also have sufficient computational power to handle the large flow of data they encounter. This research investigates one potential solution: a multi-core processor that is radiation-hardened and designed to provide highly parallelized MIMD execution of applicable workloads. A variety of benchmarking programs are used to explore the capabilities of this processor. Additionally, the source code is modified in an attempt to enhance the processor speed and efficiency; the consequent improvements in performance are documented
Reconfigurable computing for large-scale graph traversal algorithms
This thesis proposes a reconfigurable computing approach for supporting parallel processing in large-scale graph traversal algorithms. Our approach is based on a reconfigurable hardware architecture which exploits the capabilities of both FPGAs (Field-Programmable Gate Arrays) and a multi-bank parallel memory subsystem.
The proposed methodology to accelerate graph traversal algorithms has been applied to three case studies, revealing that application-specific hardware customisations can benefit performance. A summary of our four contributions is as follows.
First, a reconfigurable computing approach to accelerate large-scale graph traversal algorithms. We propose a reconfigurable hardware architecture which decouples computation and communication while keeping multiple memory requests in flight at any given time, taking advantage of the high bandwidth of multi-bank memory subsystems.
Second, a demonstration of the effectiveness of our approach through two case studies: the breadth-first search algorithm, and a graphlet counting algorithm from bioinformatics. Both case studies involve graph traversal, but each of them adopts a different graph data representation.
Third, a method for using on-chip memory resources in FPGAs to reduce off-chip memory accesses for accelerating graph traversal algorithms, through a case-study of the All-Pairs Shortest-Paths algorithm. This case study has been applied to process human brain network data.
Fourth, an evaluation of an approach based on instruction-set extension for FPGA design against many-core GPUs (Graphics Processing Units), based on a set of benchmarks with different memory access characteristics. It is shown that while GPUs excel at streaming applications, the proposed approach can outperform GPUs in applications with poor locality characteristics, such as graph traversal problems.Open Acces
A Gaussian probability accelerator for SPHINX 3
technical reportAccurate real-time speech recognition is not currently possible in the mobile embedded space where the need for natural voice interfaces is clearly important. The continuous nature of speech recognition coupled with an inherently large working set creates significant cache interference with other processes. Hence real-time recognition is problematic even on high-performance general-purpose platforms. This paper provides a detailed analysis of CMU?s latest speech recognizer (Sphinx 3.2), identifies three distinct processing phases, and quantifies the architectural requirements for each phase. Several optimizations are then described which expose parallelism and drastically reduce the bandwidth and power requirements for real-time recognition. A special-purpose accelerator for the dominant Gaussian probability phase is developed for a 0.25 CMOS process which is then analyzed and compared with Sphinx?s measured energy and performance on a 0.13 2.4 GHz Pentium4 system. The results show an improvement in power consumption by a factor of 29 at equivalent processing throughput. However after normalizing for process, the specialpurpose approach has twice the throughput, and consumes 104 times less energy than the general-purpose accelerator. The energy-delay product is a better comparison metric due to the inherent design trade-offs between energy consumption and performance. The energydelay product of the special-purpose approach is 196 times better than the Pentium4. These results provide strong evidence that real-time large vocabulary speech recognition can be done within a power budget commensurate with embedded processing using today?s technology
Center for Aeronautics and Space Information Sciences
This report summarizes the research done during 1991/92 under the Center for Aeronautics and Space Information Science (CASIS) program. The topics covered are computer architecture, networking, and neural nets
- …