63 research outputs found
Performance comparison of single-precision SPICE Model-Evaluation on FPGA, GPU, Cell, and multi-core processors
Automated code generation and performance tuning techniques for concurrent architectures such as GPUs, Cell and FPGAs can provide integer factor speedups over multi-core processor organizations for data-parallel, floating-point computation in SPICE model-evaluation. Our Verilog AMS compiler produces code for parallel evaluation of non-linear circuit models suitable for use in SPICE simulations where the same model is evaluated several times for all the devices in the circuit. Our compiler uses architecture specific parallelization strategies (OpenMP for multi-core, PThreads for Cell, CUDA for GPU, statically scheduled VLIW for FPGA) when producing code for these different architectures. We automatically explore different implementation configurations (e.g. unroll factor, vector length) using our performance-tuner to identify the best possible configuration for each architecture. We demonstrate speedups of 3- 182times for a Xilinx Virtex5 LX 330T, 1.3-33times for an IBM Cell, and 3-131times for an NVIDIA 9600 GT GPU over a 3 GHz Intel Xeon 5160 implementation for a variety of single-precision device models
Optimistic Parallelization of Floating-Point Accumulation
Floating-point arithmetic is notoriously non-associative due to the limited precision representation which demands intermediate values be rounded to fit in the available precision. The resulting cyclic dependency in floating-point accumulation inhibits parallelization of the computation, including efficient use of pipelining. In practice, however, we observe that floating-point operations are "mostly" associative. This observation can be exploited to parallelize floating-point accumulation using a form of optimistic concurrency. In this scheme, we first compute an optimistic associative approximation to the sum and then relax the computation by iteratively propagating errors until the correct sum is obtained. We map this computation to a network of 16 statically-scheduled, pipelined, double-precision floating-point adders on the Virtex-4 LX160 (-12) device where each floating-point adder runs at 296 MHz and has a pipeline depth of 10. On this 16 PE design, we demonstrate an average speedup of 6× with randomly generated data and 3-7× with summations extracted from Conjugate Gradient benchmarks
SPICE²: A Spatial, Parallel Architecture for Accelerating the Spice Circuit Simulator
Spatial processing of sparse, irregular floating-point computation using a single FPGA enables up to an order of magnitude speedup (mean 2.8X speedup) over a conventional microprocessor for the SPICE circuit simulator. We deliver this speedup using a hybrid parallel architecture that spatially implements the heterogeneous forms of parallelism available in SPICE. We decompose SPICE into its three constituent phases: Model-Evaluation, Sparse Matrix-Solve, and Iteration Control and parallelize each phase independently. We exploit data-parallel device evaluations in the Model-Evaluation phase, sparse dataflow parallelism in the Sparse Matrix-Solve phase and compose the complete design in streaming fashion. We name our parallel architecture SPICE²: Spatial Processors Interconnected for Concurrent Execution for accelerating the SPICE circuit simulator. We program the parallel architecture with a high-level, domain-specific framework that identifies, exposes and exploits parallelism available in the SPICE circuit simulator. This design is optimized with an auto-tuner that can scale the design to use larger FPGA capacities without expert intervention and can even target other parallel architectures with the assistance of automated code-generation. This FPGA architecture is able to outperform conventional processors due to a combination of factors including high utilization of statically-scheduled resources, low-overhead dataflow scheduling of fine-grained tasks, and overlapped processing of the control algorithms.
We demonstrate that we can independently accelerate Model-Evaluation by a mean factor of 6.5X(1.4--23X) across a range of non-linear device models and Matrix-Solve by 2.4X(0.6--13X) across various benchmark matrices while delivering a mean combined speedup of 2.8X(0.2--11X) for the two together when comparing a Xilinx Virtex-6 LX760 (40nm) with an Intel Core i7 965 (45nm). With our high-level framework, we can also accelerate Single-Precision Model-Evaluation on NVIDIA GPUs, ATI GPUs, IBM Cell, and Sun Niagara 2 architectures.
We expect approaches based on exploiting spatial parallelism to become important as frequency scaling slows down and modern processing architectures turn to parallelism (\eg multi-core, GPUs) due to constraints of power consumption. This thesis shows how to express, exploit and optimize spatial parallelism for an important class of problems that are challenging to parallelize.</p
Accelerating SPICE Model-Evaluation using FPGAs
Single-FPGA spatial implementations can provide
an order of magnitude speedup over sequential microprocessor
implementations for data-parallel, floating-point computation in
SPICE model-evaluation. Model-evaluation is a key component
of the SPICE circuit simulator and it is characterized by
large irregular floating-point compute graphs. We show how to
exploit the parallelism available in these graphs on single-FPGA
designs with a low-overhead VLIW-scheduled architecture. Our
architecture uses spatial floating-point operators coupled to local
high-bandwidth memories and interconnected by a time-shared
network. We retime operation inputs in the model-evaluation to
allow independent scheduling of computation and communication.
With this approach, we demonstrate speedups of 2–18×
over a dual-core 3GHz Intel Xeon 5160 when using a Xilinx
Virtex 5 LX330T for a variety of SPICE device models
Pipelining Saturated Accumulation
Aggressive pipelining and spatial parallelism allow integrated circuits (e.g., custom VLSI, ASICs, and FPGAs) to achieve high throughput on many Digital Signal Processing applications. However, cyclic data dependencies in the computation can limit parallelism and reduce the efficiency and speed of an implementation. Saturated accumulation is an important example where such a cycle limits the throughput of signal processing applications. We show how to reformulate saturated addition as an associative operation so that we can use a parallel-prefix calculation to perform saturated accumulation at any data rate supported by the device. This allows us, for example, to design a 16-bit saturated accumulator which can operate at 280 MHz on a Xilinx Spartan-3(XC3S-5000-4) FPGA, the maximum frequency supported by the component's DCM
Worst Case Latency Analysis for Hoplite FPGA-based NoC
Overlay NoCs, such as Hoplite, are cheap to implement on an FPGA but provide no bounds on worst-case routing latency of packets traversing the NoC due to deflection routing. In this paper, we show how to adapt Hoplite to enable calculation of precise upper bounds on routing latency by modifying the routing function to prioritize deflections, and by regulating the injection of packets to meet certain throughput and burstiness constraints. We provide an
analytical model for computing end-to-end latency in the form of (1) in-flight time in the network , and (2) waiting time at the source node . To bound in-flight time in an NoC, we modify the routing function
and switching crossbar richness in the Hoplite router to deliver where and are differences of the source and destination address co-ordinates of the
packet. To bound the waiting time at the source, we add a Token Bucket regulator with rate and burstiness for each flow node to deliver : T^s =\lceil\frac{\sigma(\Gamma^C_f){1-\rho(\Gamma^C_f)} \rceil which depends on the regulator period , burstiness and the rate of all interfering flows . A 64b implementation of our HopliteRT routerrequires 4\% fewer LUTs, and similar number of FFs compared to the original Hoplite router. We also need two small counters at each client port for regulating injection. We evaluate our model and RTL implementation across synthetic traffic patterns and observe behavior that conforms with the analytical bounds
Saliency on a chip: a digital approach with an FPGA
Selective-visual-attention algorithms have
been successfully implemented in analog
VLSI circuits.1 However, in addition to
the usual issues of analog VLSI—such as
the need to fi ne-tune a large number of biases—
these implementations lack the spatial
resolution and pre-processing capabilities
to be truly useful for image-processing
applications. Here we take an alternative
approach and implement a neuro-mimetic
algorithm for selective visual attention in
digital hardware
HopliteRT Source Queuing Bound Correction
We present a correction to the analytical source queuing bound for HopliteRT [1], [2], which addresses the counter-example put forward in Section IV-D of [3] by taking the effect of the in-flight jitter suffered by data flits into account. We reproduce the evaluation experiments from [1], [2] related to source queuing with this corrected approach, observing bounds that are 1.2X to1.7X larger than those originally reported
Packet Switched vs. Time Multiplexed FPGA Overlay Networks
Dedicated, spatially configured FPGA interconnect
is efficient for applications that require high throughput connections
between processing elements (PEs) but with a limited degree
of PE interconnectivity (e.g. wiring up gates and datapaths).
Applications which virtualize PEs may require a large number
of distinct PE-to-PE connections (e.g. using one PE to simulate
100s of operators, each requiring input data from thousands of
other operators), but with each connection having low throughput
compared with the PE’s operating cycle time. In these highly interconnected
conditions, dedicating spatial interconnect resources
for all possible connections is costly and inefficient. Alternatively,
we can time share physical network resources by virtualizing
interconnect links, either by statically scheduling the sharing
of resources prior to runtime or by dynamically negotiating
resources at runtime. We explore the tradeoffs (e.g. area, route
latency, route quality) between time-multiplexed and packet-switched
networks overlayed on top of commodity FPGAs. We
demonstrate modular and scalable networks which operate on
a Xilinx XC2V6000-4 at 166MHz. For our applications, time-multiplexed,
offline scheduling offers up to a 63% performance
increase over online, packet-switched scheduling for equivalent
topologies. When applying designs to equivalent area, packet-switching
is up to 2× faster for small area designs while time-multiplexing
is up to 5× faster for larger area designs. When
limited to the capacity of a XC2V6000, if all communication is
known, time-multiplexed routing outperforms packet-switching;
however when the active set of links drops below 40% of the
potential links, packet-switched routing can outperform time-multiplexing
- …