Abstract-We present a scalable run-time configurable and programmable signal processing architecture for real-time applications which covers a wide performance spectrum. Our approach goes beyond conventional special purpose signal processing engines. Scalability has multiple dimensions: on core-and network-level. We base our novel architecture on programmable components which can be re-combined and re-configured to match application specific requirements for signal processing tasks at run-time. Users of the RIVER architecture can use our pre-synthesized cores to avoid HDL-coding and lengthy FPGA translation. For evaluation we have mapped computationaland memory-intensive kernels to the RIVER architecture and achieved 250 GMACs which is significantly (1.6-2x) more than many high-end DSPs provide.
I. INTRODUCTION AND RELATED WORK
In embedded systems data must usually be processed in real-time. A wireless base station, for example, must control power levels of radio transmissions and handle high data rates of G3/G4 cellular networks. Delays and data loss would lead to deterioration of service quality.
FPGAs are capable of processing multiple data-streams in parallel and at high data rates. Next generation FPGAs will have hundreds of high-speed transceivers built-in. The combined duplex data rate of all transceivers in a Virtex-7 FPGA device is 324 GB/s for example.
Since FPGAs have become more pervasive in embedded systems it is desirable to make their capabilities accessible to a wider range of users. Our RIVER architecture allows users to perform demanding signal processing tasks without writing HDL-code. Users can, of course, write HDL-code if they wish to customize our baseline architecture. Otherwise it is sufficient to specify how many cores they want to instantiate and how they should be configured and connected.
Our novel contribution is a scalable (core and network), run-time configurable and programmable signal processing architecture which covers a wide performance spectrum.
In the following paragraphs we introduce related work. GPUs provide abundant processing power for data parallel tasks [1] . However, GPUs are currently more difficult to integrate into embedded systems than FPGAs. This may be due to difficult driver support and high power consumption. GPU PCIe-cards are limited to a maximum I/O-bandwidth of 8 GB/s (PCIe gen2), whereas the previously mentioned Virtex-7 device will achieve more than 300 GB/s in continuous streaming mode. In [2] GPUs were compared against FPGAs for video processing. The authors concluded that FPGAs have not been made redundant in the field of video processing. They found that GPUs do not provide sufficient throughput for applications with high memory usage. In [3] FPGAs were found to be more power efficient than GPUs. Nevertheless, in terms of floating-point peak-performance GPUs perform best.
Our architecture in comparison operates in real-time on continuous data streams on a cycle-by-cycle basis. The data streams may originate from sources in-and outside of FPGAs such as high-speed image sensors.
A number of FPGA convolvers have been published. Recent examples are [4, 5] . These are single purpose architectures. Furthermore, they cannot handle border pixels of images. Especially the latter requires a sophisticated memory architecture. The reported clock speeds in both papers are loweralbeit for different FPGA devices. In the following section we introduce our architecture.
II. ARCHITECTURE
Our signal processing cores are called "Dynamic Streaming Engines (DSE)". DSEs are run-time programmable, configurable and scalable by design. Figure 1 (b) shows a sample DSE architecture and a break-down into its components. From the left to the right we see: stream input ports, data taggers, a programmable crossbar, computational lanes, a configurable crossbar, a reduction stage and stream output ports.
Multiple DSEs can be connected to scale the system. Individual DSEs can scale up-and down by adapting the number of stream input-and output ports, the number of computational lanes, the complexity and number of compute lane instructions, the complexity of the reduction stage, the number of programmable buffers within computational lanes and the width of fixed point numbers. Thus scalability is an intrinsic architecture feature. A "fat" DSE, for example, can support several smaller to medium sized convolutional filters (3x3-5x5) or a single but very large filter (7x7).
Occasionally a ("fat") DSE may be underutilized or even unused for some time. Since our DSEs are compositions of Our Bluespec DSE implementation is highly configurable at design and run-time. Design time parameters are the number of stream input/output ports, the number of computational lanes, the instructions implemented by computational lanes, the type of reduction stage and width of numerical computations. In this paper computational lanes contain one or more MAC-pipelines suitable for signal processing. At run-time algorithms for signal processing, for example, can be mapped to one or more DSEs. Data flow and computations are orchestrated by micro-programming and configuring registers. Data-taggers (a) and lane-masters (e) control synchronization and execution between and inside computational lanes. Programmable buffers (g) are flexible two-ported queues. The lane-master can choose among several queue buffer operations in each cycle as shown in (h). Data can be enqueued and dequeued at both ends or bypassed. An optional look-up table may be used for storing interpolation coefficients for example.
cleanly separated functional blocks it is relatively simple to apply fine grained power management. Computational lanes are, for example, good candidates for power-and clock-gating. Within active lanes it is also feasible to switch off individual buffers. Lanes with MAC-pipelines could implement data path width adaption or pre-computations [6] .
Our architecture is able to control data flow on multiple levels. First, the programmable crossbar maps incoming data to one-or more computational lanes. Second, the configurable crossbar maps intermediate results from computational lanes to reduction stage inputs. Within the computational lanes programmable buffers control the direction and buffering of data flows (see Figure 1g) . We achieve programmability through programmable sequencers. Sequencers have program counters, jump logic, instruction memory and support arbitrarily nested loops. Their VLIW-instructions contain control signals for other hardware components.
Both data taggers and lane masters (Figure 1a ,e) are sequencers and enable cycle-accurate program control. Data taggers distribute and synchronize data flows entering through the programmable crossbar. Multiple stream input ports may, for example, be logically grouped together to ensure lockstepped data flows into computational lanes if required by algorithms. Furthermore, data taggers are multicast capable. They allow data values to be mapped to several computational lanes.
Lane masters control programmable buffers (Figure 1g,h ).
Programmable buffers are flexible 2-ported queues. Data may be enqueued and dequeued from arbitrary ports (I-II). Additionally, it is possible to bypass data (III). It is even possible to use programmable buffers as look-up tables which are indexed by incoming data. This feature is, for example, useful for storing interpolation coefficients. Thus programmable buffers are powerful means for realizing non-trivial data flows. A good example for non-trivial data flows is border pixel reflection in image filtering applications. The reduction stage is shown in Figure 1 (c) and (d). In this example the reduction stage is an adder tree. In our implementation we realized a full-adder tree suitable for 2D-convolution. Depending on the configuration we can have partial-or full summation of inputs. The adder tree can delay data values by one or more cycles if necessary for data flow control. The logarithmic structure allows us to incrementally reduce or more specifically accumulate data as it passes through the pipeline. If we had buffered up all values before reducing them, then we would require bigger buffers.
In the following section we show how 2D-convolution filters can be mapped to a DSE.
III. EXAMPLE: MAPPING 2D-CONVOLUTION TO DSES
We can realize separable-, non-separable-and parallel-filter mappings. Furthermore, we support irregular memory accesses through programmable buffers. In the following sections we explain how to map a 2D-convolution filter to a single-core DSE and explore the design space of our architecture for different filter parameters.
Before we start, we would like to introduce 2D-convolution. A 2D-convolution is a weighted sum over 2 p + 1 by 2 q + 1 values:
Hence 2D-convolutions require many multiply-accumulate (MAC) operations over many inputs and scale quadratically for n × n-filters. Some filters are separable. This means that they can be expressed as two 1-dimensional convolutions -one horizontally and vertically. For didactic reasons we show in the following section how separable filters map to our architecture. Separable filters require less computations per cycle and are therefore easier to visualize. If desired our DSEs can also be configured for non-separable and parallel filters 1 .
A. Mapping separable filters Figure 2 illustrates how non-separable filters are mapped to two computational lanes with one MAC-pipeline each. It is possible to use two DSEs or one DSE with loop-back and two MAC-pipelines. Compared to the direct implementation 1 Parallel filters can process multiple pixels per cycle.
in Figure 2 (a) , DSEs trade-in flexibility and programmability for area but not necessarily frequency. The following section shows how DSEs scale.
B. Design space exploration / Scalability
Depending on the underlying technology our architecture has different performance characteristics. Before we present our FPGA implementation results we formalize the design space for our DSEs with MAC-pipeline lanes. A n × n filter mapping with k pixels per cycle requires:
#inputs, #tagger, #comp.lanes = k #lanemaster, #outputs = k #programmable buffers = n k
Obviously, not all DSEs components scale linearly. This is partially due to the application specific MAC-pipelines for 2D-convolution.Especially, the configurable crossbar is area intensive. Currently our crossbars scale efficiently up to 32 × 32 for 18 bit inputs. There are still low hanging fruits for further optimizations though. The tipping point is technology dependent. ASIC design, for example, is much more suitable for implementing crossbars when compared to FPGAs. Independent of technology, it is advisable to use multiple DSEs once individual DSEs provide diminishing returns.
We synthesized our Bluespec HDL implementation for the Virtex 7 FPGA 2 which is currently being sampled. All components of our DSE such as the MAC-pipeline, lane master and programmable buffers achieve frequencies in excess of 300 MHz.
A complete DSE for a 3x3 filter with 4 pixels/cycle, for example, runs currently at 240 MHz. The lower clock speed is due to delays between DSE components. We are confident that additional FIFOs and placement constrains will help to raise frequencies once we start optimizing our implementation.
In addition to small DSEs we have, for example, created an 8-core DSE system with 4 lanes and 8 MAC-pipelines each -see Figure 3 . This "fat" DSE system has 1792 MACs 3 and 8 MBit worth of buffers. A rough comparison between high-end processors yields the following results: 4-core TI TMS320C6670 DSP 154 GMAC/s and Tensilica ConnX BBE64 128 GMAC/s. The 8-core DSE is 1.6-2x faster in comparison. In the following section we conclude our paper. 
IV. CONCLUSION
We introduced RIVER, a reconfigurable pre-synthesized streaming architecture for real-time signal processing on FPGAS. Current state-of-the-art signal processing cores for FPGAs are specifically tailored towards a single task and must usually be compiled from HDL-sources, whereas our pre-synthesized DSEs are instantly available, run-time reconfigurable and programmable on many levels. This feature becomes increasingly important since next generation FPGAs (Virtex-7) require tens of hours and gigabytes of memory to finish compilation.
Furthermore, we have mapped a computational-and memory intensive 2D-filter configurations on our DSEs and shown that we are able to scale over a wide performance range. Additionally, we accelerated application specific tasks by integrating specialized hardware features into computational lanes and reduction stages such as MAC-pipelines and addernetworks. As a result our architecture and framework are a flexible platform for real-time signal processing on FPGAs.
V. ACKNOWLEDGMENTS
We would like to thank Saar Drimer for making his FPGA design flow tools accessable through his website [7] which we have used as a basis for pre-synthesizing many different DSE configurations on our cluster. Additionally, we would like to thank Matthias Birk and Matthias Vogelgesang for reviewing our drafts and their suggestions.
