for the Stratix III FPGA that can be scaled to different levels ofperformance and resource utilization. It has several configurable features that can be included or excluded to optimize the soft processor for a given application. Performance estimates of the soft vector processor using three embedded benchmark kernels show speedup of up to 16.6 x over an idealized Nios II processor while using 10.9x the area.
Introduction
Performance of software applications is determined by a few hotspots in the program. This is especially true for embedded media applications, which tend to have tight, computationally-intensive inner loops [1] . However, current commercial FPGA-based soft-core processors such as MicroBlaze and Nios II provide only limited and non-scalable performance. The effectiveness of traditional architectural approaches to improve soft-core processor performance is limited by how well these architectures can be mapped into the FPGA fabric. Techniques such as deep pipelining and wide-issue superscalar do not work well in FPGAs, usually because any potential performance gains they offer are negated by reduced clock speed or prohibited by the FPGA architecture itself. For example, it is difficult to implement the several write ports needed by a superscalar register file. Instead, a vector architecture can be used to accelerate applications that are rich in data parallelism. Figure 1 . Interaction between scalar and vector cores number of parameters to define an application-specific instance of the processor. The configurability of soft vector processors in FPGAs gives significant flexibility to trade-off performance and resource utilization. Resources can be further trimmed by removing unneeded processor features and instruction support. Table 1 lists some of the configurable parameters and features of our soft vector processor architecture. The final resource usage of a soft vector processor is very sensitive to many of these parameters. V4, V8, and V16 are three sample configurations of the processor. Our implementation is tailored to the Altera Stratix III FPGA architecture. The sizes of embedded memory blocks, functionality of the hard-wired DSP blocks, and mix of logic and other resources in the Stratix III family drove many of our design decisions.
Figures 1 and 2 illustrate our soft vector processor architecture. It consists of a scalar core, a vector processing unit, and a memory interface unit. The scalar core is the single-threaded version of the UTIle [4] , a 32-bit Nios II-compatible soft processor with a fourstage pipeline. The scalar core and vector unit share the same instruction memory and instruction fetch logic. Vector instructions are 32-bit, and can be freely mixed with scalar instructions in the instruction stream. The scalar and vector units can execute different instructions concurrently, but will coordinate for instructions that require both cores, such as instructions with both scalar and vector operands.
The vector processing unit is shown in detail in Figure 2 . The vector unit is composed of a number of vector lanes, specified by the No consistency checking against the scalar cache is performed. The address generator generates addresses for vector memory accesses, and controls the memory alignment crossbar to align data to and from memory. The crossbar supports memory accesses in granularity of word, halfword and byte. The memory crossbar can align up to 16 data elements per cycle for unit stride and constant stride loads, and 4 elements per cycle for stores. Indexed offset accesses execute at one data element per cycle. The vector operation bypass path allows the memory alignment crossbar to be used for vector manipulation instructions such as extracting part of a vector. The 128-bit memory system is intended to be connected to a large on-chip SRAM or an external DDR-SDRAM, both of which are well-suited for burst reading and writing of long vectors.
The soft vector processor adopts a vector instruction set similar to the VIRAM instruction set [6] . Condi addressing, in which each vector lane supplies the address to access its own local memory. The memories can also be written to by the scalar processor through a broadcast operation that writes the same value to all local memories (possibly to different addresses). LMemW specifies the data width of the local memory and the maximum address width. If LMemW is less than VPUW, data to the local memory is truncated. These memories are useful for accelerating histogram or table-lookup operations such as AES encryption.
The second extension uses the MIAC feature of the Stratix III DSP blocks to efficiently implement the addition reduction operation (i.e., sum all elements in a vector), as shown in Figure 2 . The vmac instruction multiply-accumulates 4 pairs of inputs from 4 vector lanes into each MAC unit. Furthermore, the cascade chain in the Stratix III DSP blocks allows cascade adding of partial accumulation results across several accumulators, further accelerating the otherwise inefficient vector reduction operation. The vccacc instruction copies the result of the accumulate reduction to a vector register. The MAC feature is used, for example, in motion estimation to sum the absolute difference between pixels.
Performance
Three representative data-parallel embedded applications were chosen to benchmark scalable vector processing: 5 x 5 median filtering, motion estimation, and AES encryption. The 
Conclusion
Many embedded applications today demand high performance computing platforms that enable the programmer to easily exploit data-level parallelism. FPGAs are inherently parallel processors, but considerable design effort and hardware design knowledge is needed to design a custom parallel system in HDL. Current FPGA soft processors do not provide enough performance to meet these application requirements. A soft vector processor platform can address these deficiencies by providing a simple and familiar programming model to describe data-level parallelism for parallel computing. The FPGA-based soft vector processor proposed in this paper efficiently maps the vector architecture to a Stratix III FPGA. It leverages the configurability of FPGAs to allow the designer to trade-off performance level with resource usage, and to optimize the processor for the target application by generating an application-specific instance with only the needed features all with zero hardware design.
