5,520 research outputs found
FPGA-Based CNN Inference Accelerator Synthesized from Multi-Threaded C Software
A deep-learning inference accelerator is synthesized from a C-language
software program parallelized with Pthreads. The software implementation uses
the well-known producer/consumer model with parallel threads interconnected by
FIFO queues. The LegUp high-level synthesis (HLS) tool synthesizes threads into
parallel FPGA hardware, translating software parallelism into spatial
parallelism. A complete system is generated where convolution, pooling and
padding are realized in the synthesized accelerator, with remaining tasks
executing on an embedded ARM processor. The accelerator incorporates reduced
precision, and a novel approach for zero-weight-skipping in convolution. On a
mid-sized Intel Arria 10 SoC FPGA, peak performance on VGG-16 is 138 effective
GOPS
Recommended from our members
Behavioral synthesis from VHDL using structured modeling
This dissertation describes work in behavioral synthesis involving the development of a VHDL Synthesis System VSS which accepts a VHDL behavioral input specification and performs technology independent synthesis to generate a circuit netlist of generic components. The VHDL language is used for input and output descriptions. An intermediate representation which incorporates signal typing and component attributes simplifies compilation and facilitates design optimization.A Structured Modeling methodology has been developed to suggest standard VHDL modeling practices for synthesis. Structured modeling provides recommendations for the use of available VHDL description styles so that optimal designs will be synthesized.A design composed of generic components is synthesized from the input description through a process of Graph Compilation, Graph Criticism, and Design Compilation. Experiments were performed to demonstrate the effects of different modeling styles on the quality of the design produced by VSS. Several alternative VHDL models were examined for each benchmark, illustrating the improvements in design quality achieved when Structured Modeling guidelines were followed
QCDOC: A 10-teraflops scale computer for lattice QCD
The architecture of a new class of computers, optimized for lattice QCD
calculations, is described. An individual node is based on a single integrated
circuit containing a PowerPC 32-bit integer processor with a 1 Gflops 64-bit
IEEE floating point unit, 4 Mbyte of memory, 8 Gbit/sec nearest-neighbor
communications and additional control and diagnostic circuitry. The machine's
name, QCDOC, derives from ``QCD On a Chip''.Comment: Lattice 2000 (machines) 8 pages, 4 figure
BLITZEN: A highly integrated massively parallel machine
The architecture and VLSI design of a new massively parallel processing array chip are described. The BLITZEN processing element array chip, which contains 1.1 million transistors, serves as the basis for a highly integrated, miniaturized, high-performance, massively parallel machine that is currently under development. Each processing element has 1K bits of static RAM and performs bit-serial processing with functional elements for arithmetic, logic, and shifting
Programmable Trigger Logic Unit Based on FPGA Technology
A programmable trigger logic module (TRILOMO) was implemented successfully in
an FPGA using their internal look-up tables to save Boolean functions. Up to 16
trigger input signals can be combined logically for a fast trigger decision.
The new feature is that the trigger decision is VME register based. The changes
are made without modifying the FPGA code. Additionally the module has an
excellent signal delay adjustment.Comment: 4 pages, 4 figure
- …