1,590 research outputs found
Automatic generation of hardware Tree Classifiers
Machine Learning is growing in popularity and spreading across different fields for various applications. Due to this trend, machine learning algorithms use different hardware platforms and are being experimented to obtain high test accuracy and throughput. FPGAs are well-suited hardware platform for machine learning because of its re-programmability and lower power consumption. Programming using FPGAs for machine learning algorithms requires substantial engineering time and effort compared to software implementation. We propose a software assisted design flow to program FPGA for machine learning algorithms using our hardware library. The hardware library is highly parameterized and it accommodates Tree Classifiers. As of now, our library consists of the components required to implement decision trees and random forests. The whole automation is wrapped around using a python script which takes you from the first step of having a dataset and design choices to the last step of having a hardware descriptive code for the trained machine learning model
Interstellar: Using Halide's Scheduling Language to Analyze DNN Accelerators
We show that DNN accelerator micro-architectures and their program mappings
represent specific choices of loop order and hardware parallelism for computing
the seven nested loops of DNNs, which enables us to create a formal taxonomy of
all existing dense DNN accelerators. Surprisingly, the loop transformations
needed to create these hardware variants can be precisely and concisely
represented by Halide's scheduling language. By modifying the Halide compiler
to generate hardware, we create a system that can fairly compare these prior
accelerators. As long as proper loop blocking schemes are used, and the
hardware can support mapping replicated loops, many different hardware
dataflows yield similar energy efficiency with good performance. This is
because the loop blocking can ensure that most data references stay on-chip
with good locality and the processing units have high resource utilization. How
resources are allocated, especially in the memory system, has a large impact on
energy and performance. By optimizing hardware resource allocation while
keeping throughput constant, we achieve up to 4.2X energy improvement for
Convolutional Neural Networks (CNNs), 1.6X and 1.8X improvement for Long
Short-Term Memories (LSTMs) and multi-layer perceptrons (MLPs), respectively.Comment: Published as a conference paper at ASPLOS 202
Maximizing CNN Accelerator Efficiency Through Resource Partitioning
Convolutional neural networks (CNNs) are revolutionizing machine learning,
but they present significant computational challenges. Recently, many
FPGA-based accelerators have been proposed to improve the performance and
efficiency of CNNs. Current approaches construct a single processor that
computes the CNN layers one at a time; the processor is optimized to maximize
the throughput at which the collection of layers is computed. However, this
approach leads to inefficient designs because the same processor structure is
used to compute CNN layers of radically varying dimensions.
We present a new CNN accelerator paradigm and an accompanying automated
design methodology that partitions the available FPGA resources into multiple
processors, each of which is tailored for a different subset of the CNN
convolutional layers. Using the same FPGA resources as a single large
processor, multiple smaller specialized processors increase computational
efficiency and lead to a higher overall throughput. Our design methodology
achieves 3.8x higher throughput than the state-of-the-art approach on
evaluating the popular AlexNet CNN on a Xilinx Virtex-7 FPGA. For the more
recent SqueezeNet and GoogLeNet, the speedups are 2.2x and 2.0x
A Binaural Neuromorphic Auditory Sensor for FPGA: A Spike Signal Processing Approach
This paper presents a new architecture, design
flow, and field-programmable gate array (FPGA) implementation
analysis of a neuromorphic binaural auditory sensor, designed
completely in the spike domain. Unlike digital cochleae that
decompose audio signals using classical digital signal processing
techniques, the model presented in this paper processes information
directly encoded as spikes using pulse frequency modulation
and provides a set of frequency-decomposed audio information
using an address-event representation interface. In this case,
a systematic approach to design led to a generic process for
building, tuning, and implementing audio frequency decomposers
with different features, facilitating synthesis with custom features.
This allows researchers to implement their own parameterized
neuromorphic auditory systems in a low-cost FPGA in order to
study the audio processing and learning activity that takes place
in the brain. In this paper, we present a 64-channel binaural
neuromorphic auditory system implemented in a Virtex-5 FPGA
using a commercial development board. The system was excited
with a diverse set of audio signals in order to analyze its response
and characterize its features. The neuromorphic auditory system
response times and frequencies are reported. The experimental
results of the proposed system implementation with 64-channel
stereo are: a frequency range between 9.6 Hz and 14.6 kHz
(adjustable), a maximum output event rate of 2.19 Mevents/s,
a power consumption of 29.7 mW, the slices requirements
of 11 141, and a system clock frequency of 27 MHz.Ministerio de Economía y Competitividad TEC2012-37868-C04-02Junta de Andalucía P12-TIC-130
- …