730 research outputs found
HERO: Heterogeneous Embedded Research Platform for Exploring RISC-V Manycore Accelerators on FPGA
Heterogeneous embedded systems on chip (HESoCs) co-integrate a standard host
processor with programmable manycore accelerators (PMCAs) to combine
general-purpose computing with domain-specific, efficient processing
capabilities. While leading companies successfully advance their HESoC
products, research lags behind due to the challenges of building a prototyping
platform that unites an industry-standard host processor with an open research
PMCA architecture. In this work we introduce HERO, an FPGA-based research
platform that combines a PMCA composed of clusters of RISC-V cores, implemented
as soft cores on an FPGA fabric, with a hard ARM Cortex-A multicore host
processor. The PMCA architecture mapped on the FPGA is silicon-proven,
scalable, configurable, and fully modifiable. HERO includes a complete software
stack that consists of a heterogeneous cross-compilation toolchain with support
for OpenMP accelerator programming, a Linux driver, and runtime libraries for
both host and PMCA. HERO is designed to facilitate rapid exploration on all
software and hardware layers: run-time behavior can be accurately analyzed by
tracing events, and modifications can be validated through fully automated hard
ware and software builds and executed tests. We demonstrate the usefulness of
HERO by means of case studies from our research
ULEEN: A Novel Architecture for Ultra Low-Energy Edge Neural Networks
The deployment of AI models on low-power, real-time edge devices requires
accelerators for which energy, latency, and area are all first-order concerns.
There are many approaches to enabling deep neural networks (DNNs) in this
domain, including pruning, quantization, compression, and binary neural
networks (BNNs), but with the emergence of the "extreme edge", there is now a
demand for even more efficient models. In order to meet the constraints of
ultra-low-energy devices, we propose ULEEN, a model architecture based on
weightless neural networks. Weightless neural networks (WNNs) are a class of
neural model which use table lookups, not arithmetic, to perform computation.
The elimination of energy-intensive arithmetic operations makes WNNs
theoretically well suited for edge inference; however, they have historically
suffered from poor accuracy and excessive memory usage. ULEEN incorporates
algorithmic improvements and a novel training strategy inspired by BNNs to make
significant strides in improving accuracy and reducing model size. We compare
FPGA and ASIC implementations of an inference accelerator for ULEEN against
edge-optimized DNN and BNN devices. On a Xilinx Zynq Z-7045 FPGA, we
demonstrate classification on the MNIST dataset at 14.3 million inferences per
second (13 million inferences/Joule) with 0.21 s latency and 96.2%
accuracy, while Xilinx FINN achieves 12.3 million inferences per second (1.69
million inferences/Joule) with 0.31 s latency and 95.83% accuracy. In a
45nm ASIC, we achieve 5.1 million inferences/Joule and 38.5 million
inferences/second at 98.46% accuracy, while a quantized Bit Fusion model
achieves 9230 inferences/Joule and 19,100 inferences/second at 99.35% accuracy.
In our search for ever more efficient edge devices, ULEEN shows that WNNs are
deserving of consideration.Comment: 14 pages, 14 figures Portions of this article draw heavily from
arXiv:2203.01479, most notably sections 5E and 5F.
Run-time reconfigurable acceleration for genetic programming fitness evaluation in trading strategies
Genetic programming can be used to identify complex patterns in financial markets which may lead to more advanced trading strategies. However, the computationally intensive nature of genetic programming makes it difficult to apply to real world problems, particularly in real-time constrained scenarios. In this work we propose the use of Field Programmable Gate Array technology to accelerate the fitness evaluation step, one of the most computationally demanding operations in genetic programming. We propose to develop a fully-pipelined, mixed precision design using run-time reconfiguration to accelerate fitness evaluation. We show that run-time reconfiguration can reduce resource consumption by a factor of 2 compared to previous solutions on certain configurations. The proposed design is up to 22 times faster than an optimised, multithreaded software implementation while achieving comparable financial returns
X-TIME: An in-memory engine for accelerating machine learning on tabular data with CAMs
Structured, or tabular, data is the most common format in data science. While
deep learning models have proven formidable in learning from unstructured data
such as images or speech, they are less accurate than simpler approaches when
learning from tabular data. In contrast, modern tree-based Machine Learning
(ML) models shine in extracting relevant information from structured data. An
essential requirement in data science is to reduce model inference latency in
cases where, for example, models are used in a closed loop with simulation to
accelerate scientific discovery. However, the hardware acceleration community
has mostly focused on deep neural networks and largely ignored other forms of
machine learning. Previous work has described the use of an analog content
addressable memory (CAM) component for efficiently mapping random forests. In
this work, we focus on an overall analog-digital architecture implementing a
novel increased precision analog CAM and a programmable network on chip
allowing the inference of state-of-the-art tree-based ML models, such as
XGBoost and CatBoost. Results evaluated in a single chip at 16nm technology
show 119x lower latency at 9740x higher throughput compared with a
state-of-the-art GPU, with a 19W peak power consumption
FPGA-based Query Acceleration for Non-relational Databases
Database management systems are an integral part of today’s everyday life. Trends like smart applications, the internet of things, and business and social networks require applications to deal efficiently with data in various data models close to the underlying domain. Therefore, non-relational database systems provide a wide variety of database models, like graphs and documents. However, current non-relational database systems face performance challenges due to the end of Dennard scaling and therefore performance scaling of CPUs. In the meanwhile, FPGAs have gained traction as accelerators for data management.
Our goal is to tackle the performance challenges of non-relational database
systems with FPGA acceleration and, at the same time, address design challenges of FPGA acceleration itself. Therefore, we split this thesis up into two main lines of work: graph processing and flexible data processing.
Because of the lacking benchmark practices for graph processing accelerators, we propose GraphSim. GraphSim is able to reproduce runtimes of these accelerators based on a memory access model of the approach. Through this simulation environment, we extract three performance-critical accelerator properties: asynchronous graph processing, compressed graph data structure, and multi-channel memory. Since these accelerator properties have not been combined in one system, we propose GraphScale. GraphScale is the first scalable, asynchronous graph processing accelerator working on a compressed graph and outperforms all state-of-the-art graph processing accelerators.
Focusing on accelerator flexibility, we propose PipeJSON as the first FPGA-based JSON parser for arbitrary JSON documents. PipeJSON is able to achieve
parsing at line-speed, outperforming the fastest, vectorized parsers for CPUs. Lastly, we propose the subgraph query processing accelerator GraphMatch which outperforms state-of-the-art CPU systems for subgraph query processing and is able to flexibly switch queries during runtime in a matter of clock cycles
- …