1,038 research outputs found
Packet Switched vs. Time Multiplexed FPGA Overlay Networks
Dedicated, spatially configured FPGA interconnect
is efficient for applications that require high throughput connections
between processing elements (PEs) but with a limited degree
of PE interconnectivity (e.g. wiring up gates and datapaths).
Applications which virtualize PEs may require a large number
of distinct PE-to-PE connections (e.g. using one PE to simulate
100s of operators, each requiring input data from thousands of
other operators), but with each connection having low throughput
compared with the PE’s operating cycle time. In these highly interconnected
conditions, dedicating spatial interconnect resources
for all possible connections is costly and inefficient. Alternatively,
we can time share physical network resources by virtualizing
interconnect links, either by statically scheduling the sharing
of resources prior to runtime or by dynamically negotiating
resources at runtime. We explore the tradeoffs (e.g. area, route
latency, route quality) between time-multiplexed and packet-switched
networks overlayed on top of commodity FPGAs. We
demonstrate modular and scalable networks which operate on
a Xilinx XC2V6000-4 at 166MHz. For our applications, time-multiplexed,
offline scheduling offers up to a 63% performance
increase over online, packet-switched scheduling for equivalent
topologies. When applying designs to equivalent area, packet-switching
is up to 2Ă— faster for small area designs while time-multiplexing
is up to 5Ă— faster for larger area designs. When
limited to the capacity of a XC2V6000, if all communication is
known, time-multiplexed routing outperforms packet-switching;
however when the active set of links drops below 40% of the
potential links, packet-switched routing can outperform time-multiplexing
Accelerating SPICE Model-Evaluation using FPGAs
Single-FPGA spatial implementations can provide
an order of magnitude speedup over sequential microprocessor
implementations for data-parallel, floating-point computation in
SPICE model-evaluation. Model-evaluation is a key component
of the SPICE circuit simulator and it is characterized by
large irregular floating-point compute graphs. We show how to
exploit the parallelism available in these graphs on single-FPGA
designs with a low-overhead VLIW-scheduled architecture. Our
architecture uses spatial floating-point operators coupled to local
high-bandwidth memories and interconnected by a time-shared
network. We retime operation inputs in the model-evaluation to
allow independent scheduling of computation and communication.
With this approach, we demonstrate speedups of 2–18×
over a dual-core 3GHz Intel Xeon 5160 when using a Xilinx
Virtex 5 LX330T for a variety of SPICE device models
A time-multiplexed FPGA overlay with linear interconnect
Coarse-grained overlays improve FPGA design pro- ductivity by providing fast compilation and software like pro- grammability. Soft processor based overlays with well-defined ISAs are attractive to application developers due to their ease of use. However, these overlays have significant FPGA resource overheads. Time multiplexed (TM) CGRA-like overlays represent an interesting alternative as they are able to change their behavior on a cycle by cycle basis while the compute kernel executes. This reduces the FPGA resource needed, but at the cost of a higher initiation interval (II) and hence reduced throughput.
The fully flexible routing network of current CGRA-like overlays results in high FPGA resource usage. However, many application kernels are acyclic and can be implemented using a much simpler linear feed-forward routing network. This paper examines a DSP block based TM overlay with linear interconnect where the overlay architecture takes account of the application kernels’ characteristics and the underlying FPGA architecture, so as to minimize the II and the FPGA resource usage. We examine a number of architectural extensions to the DSP block based functional unit to improve the II, throughput and latency. The results show an average 70% reduction in II, with corresponding improvements in throughput and latency
Toolflows for Mapping Convolutional Neural Networks on FPGAs: A Survey and Future Directions
In the past decade, Convolutional Neural Networks (CNNs) have demonstrated
state-of-the-art performance in various Artificial Intelligence tasks. To
accelerate the experimentation and development of CNNs, several software
frameworks have been released, primarily targeting power-hungry CPUs and GPUs.
In this context, reconfigurable hardware in the form of FPGAs constitutes a
potential alternative platform that can be integrated in the existing deep
learning ecosystem to provide a tunable balance between performance, power
consumption and programmability. In this paper, a survey of the existing
CNN-to-FPGA toolflows is presented, comprising a comparative study of their key
characteristics which include the supported applications, architectural
choices, design space exploration methods and achieved performance. Moreover,
major challenges and objectives introduced by the latest trends in CNN
algorithmic research are identified and presented. Finally, a uniform
evaluation methodology is proposed, aiming at the comprehensive, complete and
in-depth evaluation of CNN-to-FPGA toolflows.Comment: Accepted for publication at the ACM Computing Surveys (CSUR) journal,
201
- …