Current FPGA soft processor systems use dedicated hardware modules or accelerators to speed up data-parallel applications. This work explores an alternative approach of using a soft vector processor as a general-purpose accelerator. The approach has the benefits of a purely softwareoriented development model, a fixed ISA allowing parallel software and hardware development, a single accelerator that can accelerate multiple applications, and scalable performance from the same source code. With no hardware design experience needed, a software programmer can make area-versus-performance trade-offs by scaling the number of functional units and register file bandwidth with a single parameter. A soft vector processor can be further customized by a number of secondary parameters to add or remove features for a specific application to optimize resource utilization. This article introduces VIPERS, a soft vector processor architecture that maps efficiently into an FPGA and provides a scalable amount of performance for a reasonable amount of area. Compared to a Nios II/s processor, instances of VIPERS with 32 processing lanes achieve up to 44× speedup using up to 26× the area.
INTRODUCTION
Soft-core processors such as MicroBlaze [Xilinx, Inc. 2008] help simplify control-intensive system design because software programming is far easier than hardware design. This shortens time-to-market and shrinks development costs. However, the amount of compute performance available from softcore processors is strictly limited. For example, the highest-performing Nios II [Altera Corp. 2008c ] soft processor uses a single-issue, in-order RISC pipeline. Many embedded applications such as those found in EEMBC [2008] have plenty of data-level parallelism available, but they are performance-limited by the processor design itself.
Three ways of accelerating FPGA-based applications with plenty of datalevel parallelism are: (1) build a multiprocessor system and write parallel code, or (2) build a dedicated hardware accelerator in the FPGA logic, or (3) improve the processor design to exploit more parallelism. The first approach requires worrying about the complexity of parallel system design, debugging, and coping with incoherent memory or deadlock. The second approach requires some level of hardware design experience, even with high-level tools like the Nios II C2H Compiler [Altera Corp. 2008b ] which automatically compiles a C function into a hardware accelerator. The third approach is limited because traditional VLIW and superscalar architectures do not scale much beyond 4-way parallelism and do not map efficiently to FPGAs.
An ideal approach would combine the advantages of all these methods:
(1) have scalable performance and resource usage, (2) be simple to use, ideally requiring no hardware design effort, (3) separate hardware and software design flows early in the design, and (4) enable rapid development by avoiding synthesis, place-and-route iterations.
A soft vector processor, such as the one shown in Figure 1 , addresses all of these requirements. A vector processor is a particularly good choice for applications with abundant data parallelism. These same applications are also frequently accelerated using custom-designed hardware.
This article introduces VIPERS, Vector ISA Processors for Embedded Reconfigurable Systems, as a solution to deliver scalable performance and resource usage through configurability for data-parallel applications. It provides a simple programming model that can be easily understood by software developers, and its application-independent architecture allows hardware and software development to be separated. A single vector unit can accelerate multiple applications. Also, unlike many C-to-hardware flows, modifications to an accelerated function simply require a software compile; there is no need to place and route again.
A soft vector processor from VIPERS can easily exploit the configurability of FPGAs by customizing the amount of parallelism, feature set, and instruction support needed by the application. Customization extends to a flexible memory interface, which can be tailored to the required data access granularity. This opens a rich design space to the user and allows greater area/performance · 12: 3 trade-offs than current soft processor solutions, yet it is simple enough that software programmers can understand the trade-offs.
An earlier version of this work appeared as Yu et al. [2008] . This article represents a significant extension of that work, presenting more details on C2H, the vector execution model, vector memory interface design, vector pipeline organization, improved vector assembly code for all three benchmarks, and fully unrolled median filter and motion estimation vector code that is up to 5x faster. The performance model has been expanded and altered to portray the current vector processor implementation rather than a more ideal implementation. Performance results of the vector processor are now based on cycle-accurate Verilog simulations using ModelSim rather than the performance model. Also, two new tables demonstrate control over the amount of used resources by varying the soft processor parameters. Full vector ISA details are available in Yu [2008] . The processor, assembler, and benchmark code are available online. 
BACKGROUND AND PREVIOUS WORK
Vector processing has been used in supercomputers for scientific tasks for over three decades. Modern supercomputing systems like the Earth Simulator [Habata et al. 2003 ] and Cray X1 are based on single-chip vector processors. The following sections give an overview of vector processing, and the current solutions for improving the performance of soft processors and accelerating FPGA-based applications according to the categories introduced in Section 1.
Vector Processing Overview
The vector processing model operates on vectors of data. Each vector operation specifies an identical operation on the individual data elements of the source vectors, producing an equal number of independent results. This also makes it a natural method to exploit data-level parallelism, which has the same properties. Vector processors use pipelining and multiple parallel datapaths (called vector lanes) to reduce execution time. For a more detailed description of classical vector architectures, please see Hwang and Briggs [1984] .
Vector Instruction Set Architecture.
A vector architecture contains a vector unit and a separate scalar unit. The scalar unit is needed to execute nonvectorizable portions of the code, and most control flow instructions. The scalar unit can also broadcast operands to all vector lanes, such as adding a constant to a vector.
The vector unit performs loop-like operations by repeating a single operation to all vector elements. Vector instructions are controlled by a Vector Length (VL) register, which specifies the number of elements within the vector to operate on. This vector length register can be modified on a per-instruction basis. Applications are usually written to be aware of the Maximum VL (MVL) of the processor on which it is running. This saves significant overhead on tight looping operations, since there is no need for increment and compare/branch instructions each iteration.
Vector performance also relies upon good memory performance, so it is important to have a rich set of vector load and store instructions. Vector addressing modes can efficiently scatter or gather entire vectors of data to and from memory. The three primary vector addressing modes are: unit stride, constant stride, and indexed addressing. Unit stride accesses data elements in adjacent memory locations, constant stride accesses data elements in memory with a constant-size separation between elements, and indexed addressing accesses data elements by adding a variable offset for each element to a common base address. After a vector memory access, the address base register can also be auto-incremented by a constant amount.
Example.
Consider the 8-tap Finite Impulse Response (FIR) filter,
This can be implemented in Nios II assembly language where 65 instructions are executed per result.
The same FIR filter implemented for VIPERS is shown in Figure 2 . Some initialization instructions are omitted for brevity. The vector code computes a single result by multiplying all coefficients with data samples using just the vmac/vcczacc instruction pair. To prepare for the next result, data samples are shifted by one vector element position. A total of 10 instructions (from the ".L5" label to the "bne r11, r6, .L5" branch) are executed per result. Scaling the FIR filter to 64 filter taps requires executing 458 Nios II instructions; the · 12: 5 VIPERS approach requires executing the same 10 instructions with a vector length of 64 elements.
Most VIPERS instructions are based upon the VIRAM instruction set [Kozyrakis and Patterson 2003b] . However, the three capitalized instructions in Figure 2 are unique to VIPERS and replace a sequence of eight VIRAM instructions. These instructions utilize FPGA resources to speed up multiplyaccumulate, vector reduction (summation), and adjacent-element shifting. These features are described in more detail in Section 3.4.
Related Vector Work
A number of single-chip and FPGA-based vector processors are described in the following sections. Closely related is vector-inspired SIMD processing, which is widely used in multimedia instruction extensions found in mainstream microprocessors.
2.2.1 SIMD Extensions. SIMD extensions such as Intel SSE [Thakkur and Huff 1999] and PowerPC AltiVec [Diefendorff et al. 2000 ] are oriented towards short, fixed-length vectors. An entire SIMD vector register is typically 128-bits long and can be divided into 8-, 16-or 32-bit wordlengths, allowing vector lengths to range from 4 to 16 elements. SIMD instructions operate on these short vectors, and each instruction typically executes in a single cycle. In general, SIMD extensions lack support for strided memory access patterns and more complex memory manipulation instructions, hence they must devote many instructions to address transformation and data manipulation to support the instructions that do the actual computation [Talla and John 2001] .
2.2.2
Single-Chip Vector Processors. Torrent T0 [Asanovic 1998 ] and VIRAM [Kozyrakis and Patterson 2003b] are single-chip vector microprocessors implemented as custom ASICs that support a complete vector architecture. The T0 vector unit attaches as a coprocessor to a MIPS scalar unit. The vector unit has 16 vector registers, a maximum vector length of 32, and 8 parallel vector lanes. VIRAM is implemented in 0.18µm technology and runs at 200 MHz. It has 32 registers, a 16-entry vector flag register file, and two ALUs replicated over 4 parallel vector lanes. These two vector processors share the most similarity in architecture to this work.
FPGA-Based Vector Processors.
Vector processors have been implemented in FPGAs as an ASIC prototype [Casper et al. 2005] and as an application-specific solution [Hasan and Ziavras 2005] . Five additional implementations specifically designed for FPGAs are described.
The first, Cho et al. [2006] , is designed for Xilinx Virtex-4 SX and operates at 169 MHz. It contains 16 processing lanes, each 16-bits wide, and 17 onchip memory banks connected to a MicroBlaze processor through Fast Simplex Links (FSL). It is not clear how many vector registers are supported. Compared to MicroBlaze, speedups of 4-10× are demonstrated with four applications (FIR, IIR, matrix multiply, and 8×8 DCT). The processor implementation seems fairly complete.
The second, Jacob et al. [2006] , is a soft vector processor for biosequence applications. It contains an instruction controller that executes control flow instructions and broadcasts vector instructions to an array of 16-bit wide processing elements. Details are scarce, but it seems to have limited features and few instructions. Like this work, it also argues for a soft vector processor core.
The third, Yang et al. [2007] , consists of two identical vector processors located on two Xilinx XC2V6000 FPGA chips. Each 70 MHz processor contains a simplified scalar core with 16 instructions. The vector part consists of 8 vector registers, 8 lanes (each containing a 32-bit floating-point unit), and supports a Maximum Vector Length (MVL) of 64. Eight vector instructions are supported: vector load/store, vector indexed load/store, and vector-vector and vector-scalar multiplication/addition. However, only matrix multiplication was demonstrated on the system. The fourth, Chen et al. [2008] , is a floating-point vector unit based on T0 that operates at 189 MHz on a Xilinx Virtex II Pro device. It has 16 vector registers of 32 bits, a vector length of 16, and 8 vector lanes. Three functional units are implemented: floating-point adder, floating-point multiplier, and vector memory unit that interfaces to a 256-bit memory bus. No control processor is included for nonfloating-point or memory instructions, and it is unclear whether addressing modes other than unit-stride access are implemented.
The fifth, VESPA [Yiannacouras et al. 2008] , closely follows VIRAM but is optimized for FPGA implementation. It is based on a custom MIPS-compatible scalar core, supports fine-grained instruction subsetting, and uses on-chip cache and off-chip SDRAM memory. The VESPA implementation is very similar in area use and clock speed to VIPERS. In contrast, VIPERS is less strict about VIRAM compliance, so it offers double the vector registers and several new instructions to take advantage of FPGA resources.
Alternatives for Accelerating Data-Parallel Applications
Alternative methods for accelerating data-parallel applications are: multiprocessor systems, custom-designed and synthesized hardware accelerators, and other soft processor architectures. The following sections describe these alternatives in more detail.
2.3.1 Multiprocessor Systems. The parallelism in multiprocessor systems can be described as Multiple Instruction Multiple Data (MIMD) or Single Instruction Multiple Data (SIMD). Each processor in a MIMD system has its own instruction memory, executes its own instruction stream, and operates on different data. In contrast, a SIMD system has a single instruction stream that is shared by all processors. For example, the vector processor in this article is a type of SIMD system.
Although SIMD systems are almost always homogeneous, MIMD systems can be heterogeneous. In particular, FPGAs make it easy to create MIMD systems that can accelerate heterogeneous workloads (e.g., multiprogramming) that otherwise do not parallelize easily. Specialized systems with unique architectures can be designed to exploit the characteristics of a particular application for better performance. For example, the IPv4 packet forwarding system in Ravindran et al. [2005] has 14 MicroBlaze processors and two hard-core PowerPC processors arranged in four parallel processor pipelines with each stage performing a specific task.
2.3.2 Custom-Designed Hardware Accelerators. Soft processors often use hardware accelerators to speed up certain portions of an application. Traditionally, these accelerators are designed manually in HDL by a hardware designer. The interface between the soft processor and the accelerator can vary, but generally falls into two categories. Custom instruction accelerators are custom logic blocks within the processor's datapath, mapped to special opcodes in the processor's ISA. They can directly access the processor's register file, and effectively extend the functionality of the processor's ALU. Coprocessor accelerators are decoupled from the main processor, ideally allowing the CPU and the accelerator to execute concurrently. They can also have direct access to memory.
2.3.3 Synthesized Hardware Accelerators. Modern behavioral synthesis techniques can automatically create hardware accelerators from software. In particular, synthesizing circuits from C has long been advocated due to its widespread use in embedded designs. Like custom-designed hardware accelerators, they can be divided into two categories: ASIPs and synthesized coprocessors. Application-Specific Instruction set Processors (ASIPs) usually focus on the ASIC market, but recent research targets soft processors as well [Dinh et al. 2008] . ASIPs have configurable instruction sets that allow the user to extend the processor by adding custom instructions to replace common code sequences. An important goal is automatic generation of the custom instructions from benchmark software. This frequently requires the use of proprietary synthesis languages and compilers. The Tensilica Xtensa [Tensilica, Inc. 2008] and ARC Core processors [ARC International 2008] are two commercial examples.
The Nios II C2H Compiler [Altera Corp. 2008b] automatically synthesizes a single C function into a hardware accelerator. It is automatically connected to a Nios II memory system through the Avalon system fabric [Altera Corp. 2008a] . Figure 3 shows a Nios II system with one C2H accelerator and connections between the various components of the system. The C2H compiler synthesizes pipelined hardware from the C source code using parallel scheduling and direct memory access. Each C assignment statement infers a pipeline register. Loops are automatically pipelined and scheduled, but not unrolled; this must be done manually. Initially, each C memory reference is handled by creating its own master port in the accelerator hardware. Normally, when several master ports connect to the same memory block, the Avalon system fabric creates an arbiter to serialize accesses. As an additional optimization, C2H automatically merges all its own master ports connected to the same memory block by combining the references and scheduling them internally.
Some other C-based/C-like languages or synthesis tools include Carte by SRC Computers, Catapult C Synthesis by Mentor Graphics, CHiMPS by · 12: 9
Xilinx, Cynthesizer by Forte, Dime-C by Nallatech, Handel-C by Celoxica, Impulse C by Impulse Accelerated Technologies, Mitrion-C by Mitrionics, PICO Express by Synfora, NAPA C by National Semiconductor and SA-C by Colorado State University, Streams-C by Los Alamos National Laboratory, and SystemC by the Open SystemC Initiative.
2.3.4 Soft Processor Architectures. VLIW and superscalar architectures have also been used in FPGAs for acceleration [Grabbe et al. 2003; Brost et al. 2007; Jones et al. 2005; Saghir et al. 2006] . These designs offer increased parallelism, typically up to about 4 functional units. FPGAs have also been used to prototype VLIW and superscalar architectures intended for full-custom implementation, but they tend to be inefficient in area and performance [Lu et al. 2007; Ray and Hoe 2003 ].
2.3.5 Drawbacks. The acceleration methods described previously each have significant drawbacks. Multiprocessor systems are very flexible, but are complex both to design and use. Significant hardware knowledge is needed to design a multiprocessor system, including consideration of issues such as interconnect, memory architecture, cache coherence and memory consistency protocols, dynamic routing, flow control, and deadlock avoidance. Users also need parallel programming and debugging tools to use these systems.
Custom-designed hardware accelerators require hardware design effort to implement, verify, and test. Also, for each portion of the application to accelerate, a different hardware accelerator is needed. This adds further time and design effort. Effort is also required to keep these accelerators off the critical path of the soft processor core. Otherwise, CPU operating frequency is reduced and this will affect all nonaccelerated software.
With synthesized accelerators, a common drawback is that a change to the accelerated software function often requires RTL regeneration, which implies repeating synthesis, place, and route. This can also make it difficult to achieve a targeted clock frequency.
An improved soft processor can accelerate several applications and does not require hardware design knowledge or effort to use. VLIW architectures require a multiported register file to support multiple functional units. Mapping several read and write ports to embedded FPGA memories is very costly; the number of memory blocks needed is the product of the read and write ports. This can be partially alleviated with partitioned register files, but this adds restrictions on which registers can be accessed by each instruction. Superscalar processors suffer the same problems as VLIW approaches. However, in addition, they require complex dependency-checking and instruction-issuing hardware.
The next section will present the detailed architecture and design of a soft vector processor, an approach that combines the best advantages of most of these accelerator techniques, and overcomes many of their drawbacks. In particular, it provides a simple programming model to exploit parallelism, does not require hardware design knowledge, does not require resynthesizing the hardware when software is modified, and maps well to FPGAs.
Despite the many challenges of C-based synthesis [Edwards 2006] , it is of great interest both academically and commercially. As a result, this article compares performance of VIPERS with the C2H tool in Section 5.
CONFIGURABLE SOFT VECTOR PROCESSOR
VIPERS is a family of soft vector processors with varying performance and resource usage, and a configurable feature set to suit different applications. A number of parameters are used to configure the highly parameterized HDL source code and generate an application-or domain-specific processor instance. The configurability gives designers flexibility to trade off performance and resource utilization. It also allows some fine-tuning of resource usage by removal of unneeded features and instructions. Figure 1 shows the VIPERS architecture, consisting of a scalar core, a vector processing unit with multiple vector lanes, a memory unit with address generation logic, and memory crossbars to control data movement.
Soft Vector Architecture
VIPERS is implemented in Verilog and targets an Altera Stratix III. During the implementation, we identified certain ways traditional vector architectures can be adapted to exploit the many multiplier and memory blocks in modern FPGA architectures:
(1) use of a partitioned register file to scale bandwidth and reduce complexity, (2) use of Multiply-Accumulate (MAC) units for vector reduction, (3) use of a local memory in each vector lane for table-lookup functions, (4) use of a large number of vector registers, and (5) use of long vector registers.
These adaptations are presented in greater detail throughout the rest of this section.
3.1.1 Scalar Core. VIPERS uses the UT IIe [Fort et al. 2006 ] processor as its scalar core. HDL source is required for tight integration, precluding the use of a real Nios II.
The UT IIe is a relatively complete Nios II ISA implementation. It has a 4-stage pipeline, but an instruction requires several clock cycles to execute. This is because it is designed for multithreaded execution, meaning it has no hazard detection logic or forwarding. Hence, it can only issue a new instruction (in the same thread) after the previous one has completed. Despite this performance drawback, which can be fixed, we chose this core to speed development; a pretested core saves significant development effort. Scalar and vector instructions are both 32 bits in length and can be freely mixed in the instruction stream. This allows both units to share instruction fetch logic. The two units can execute different instructions concurrently, but will coordinate via the FIFO queues when needed, for example, for an instruction that uses both scalar and vector operands. Neither VIPERS nor UT IIe uses caches. Instead of an instruction cache, the scalar core and vector unit share a dedicated, on-chip, single-cycle instruction memory. Instead of a data cache, scalar data accesses use a dedicated port to the vector memory crossbars discussed in Section 3.3. This ensures all data accesses between the two cores are kept consistent. Although this increases scalar access latency by an additional cycle or two, the benchmark kernels examined in this article do not have any scalar memory accesses in their performance-critical sections; the scalar register file is sufficient to provide all scalar values required.
3.1.2 Vector Instruction Set Architecture. VIPERS adds vector instructions to Altera's existing Nios II scalar instruction set. All of the original Nios II scalar instructions still use their original encodings. The VIPERS instructions are limited to three undefined 6-bit OP codes, 0x3D to 0x3F. Although Nios II provides a "custom" OP code, it is too small to fit all of the added instructions (and planned future extensions). The "custom" OP code also presents other complications, such as an inflexible bit encoding.
The VIPERS instruction set borrows heavily from the instruction set of VIRAM, but makes modifications to target embedded applications and FPGAs. The instruction set includes 46 vector integer arithmetic (excluding divide), logical, memory, and vector and flag manipulation instructions, plus an additional 14 instructions for unsigned variants. However, it removes support for virtual memory, floating-point data types, and certain vector and flag manipulation instructions. Most instructions can be paired with one of two vector flag registers, which are 1-bit predicates used to mask execution of individual elements. Many VIRAM fixed-point instructions are not yet implemented.
VIPERS defines 64 vector architectural registers. This large number was selected to make better use of relatively large FPGA memories. A large set of vector registers increases performance by acting as a software-managed cache. It also enables certain optimizations that increase register pressure, such as loop unrolling. Table I lists new instructions that are added to VIPERS to support new features enabled by FPGA resources. These are described in Section 3.4. the passing of partial results from one functional unit to the next between two data-dependent instructions before the entire result vector has been computed by the first unit. Chaining through the register file has a significant drawback: It requires at least one additional read port (to provide one operand for the dependent instruction) and one additional write port (for the result) for each chained functional unit. This contributes to the complexity and size of a traditional vector register file. Since this cannot be implemented efficiently in an FPGA, a different scheme was used.
VIPERS uses a hybrid vector-SIMD execution model illustrated in Figure 4 (b). In this model, vector instructions are executed both in SIMD fashion by repeating the same operation across all vector lanes, and in traditional vector fashion by repeating the operation over several clock cycles. Hence, the number of cycles to execute a typical vector arithmetic instruction is the current vector length divided by the number of vector lanes (rounded up). The number of vector lanes in a soft vector processor can be potentially quite large to take advantage of the programmable fabric, so the number of clock cycles required to process each vector is likely to be small. This allows chaining to be removed and simplifies the design of the register file.
Vector Lane Datapath
Details of the VIPERS vector lane datapath are shown in Figure 5 . The number of vector lanes is specified by the NLane parameter. Each vector lane has a complete copy of the functional units, a partition of the vector register file and vector flag registers, a load-store buffer, and a local memory if parameter LMemN is greater than zero. The vector lane data width is determined by the parameter VPUW. independently without interlane communication for most vector instructions. NLane is the primary determinant of the processor's performance (and area). With additional vector lanes, a fixed-length vector can be processed in fewer cycles, improving performance. In the current implementation, NLane must be a power of 2.
3.2.1 Vector Pipeline. In addition to the instruction fetch stage from the scalar core, each vector lane uses a four-stage execution pipeline. Deeper pipelining would allow a higher operating frequency, but the pipeline is intentionally kept short so a vector instruction can complete in a small number of cycles. This avoids the need for either chaining or forwarding multiplexers. With the shared instruction fetch stage, the entire processor can only fetch and issue one instruction per cycle. As shown in Figure 1 , the vector unit has a separate decoder for decoding vector instructions. The memory unit has an additional decoder to allow overlapped execution of a vector ALU and vector memory instruction. Note, however, that all loads and stores are executed in program order.
The vector unit implements Read After Write (RAW) hazard resolution through pipeline interlocking. The decode stage detects a data dependence between instructions, and stalls the newest instruction if a pipeline hazard is detected until it is resolved. For example, this will typically happen when dependent instructions use a vector length of 3 × NLane or smaller. However, this penalty can often be avoided by improved instruction scheduling. For example, the dependent instruction chain in the motion estimation benchmark in Section 4.1 is avoided by software pipelining: The loop is unrolled once and rescheduled to interleave two independent iterations. This works because there is no loop-carried dependence, a property typical of vectorizable data-parallel applications.
The decode stage also detects RAW hazards between the vector unit and the memory unit. Also, indexed memory accesses stall the entire vector core for the memory unit to read address offsets from the vector register file.
Datapath Functional Units.
The functional units within each vector lane datapath include an ALU, a single-cycle barrel shifter, and a multiplier. The ALU supports arithmetic and logical operations, maximum/minimum, merge, absolute value, absolute difference, and comparisons. The barrel shifter is implemented in log(n) levels of multiplexers, and the multiplier is implemented using DSP blocks. The multiplier takes up one quarter of a DSP block with 16-bit inputs, and half a DSP block with 32-bit inputs.
Distributed Vector Register
File. The VIPERS vector register file is distributed across vector lanes. This avoids the problem with traditional vector architectures of requiring too many ports on a large, centralized vector register file [Kozyrakis and Patterson 2003a] . The vector register file is element-partitioned; each vector lane has its own register file that contains all 64 vector registers, but only a few data elements of each vector register [Asanovic 1998 ]. This is shown in Figure 5 , where the 4 vertical dark-gray stripes together represent a single vector register that spans all lanes, with 4 vector elements per lane. The actual number of elements per lane may not equal 4; in general, it is MVL NLane . This element partitioning scheme divides the vector register file into parts that can be implemented using the embedded memory blocks on the FPGA. This allows parallel access to multiple data elements of a vector register every cycle. Furthermore, the distributed vector register file saves area compared to a large, multiported vector register file.
Although VIRAM supports only 32 architectural registers, the large embedded memory blocks in Stratix III encouraged the use of 64 registers in VIPERS. Assigning four (32-bit) elements of each vector register to each lane fills one M9K RAM; this is duplicated to provide two read ports. For this reason, the Maximum Vector Length (MVL) supported by a processor instance is typically 4 × NLane for a 32-bit VPUW. Hence, most vector instructions that use the full vector length execute in 4 clock cycles.
3.2.4
Load/Store Buffers. Two FIFO queues are used to separately buffer load and store data between the vector lanes and the memory crossbar. For a vector memory store, the vector lane datapath can process a different instruction as soon as it transfers data from the vector register file to the store queue. During a vector memory load, the vector memory unit places data from memory into the load buffers without interrupting the vector lane datapath. After · 12: 15 all data has been loaded into the buffers, the vector controller inserts a writeback micro-operation into the pipeline to move the data from the load buffers to the vector register file; this stalls the vector pipeline from executing a new instruction until the last part of the load is finished. Despite the buffers, all loads and stores execute in program order to maintain memory consistency. Pipeline interlocking resolves dependence issues, allowing vector loads and stores to be intelligently scheduled to increase concurrency and hide memory latency.
3.2.5 Local Memory. Each vector lane can instantiate a local memory by setting the global LMemN parameter to the number of words in the memory. This local memory is noncoherent, and exists in a separate address space from main memory. The local memory uses register-indirect addressing through the vldl and vstl instructions, in which each vector lane supplies the address to access its own local memory. Like the distributed vector register file, it is normally split into 4 separate sections: One for each of the four data elements in a vector lane. However, if the parameter LMemShare is On, the four sections are merged, and the entire local memory becomes shared between all the elements that reside in the same lane. This mode provides a slightly larger table for applications that use the same table contents for all vector element locations.
Memory Unit
The memory unit handles accesses for both scalar and vector units. Scalar and vector memory instructions are processed in program order. Vector memory instructions are processed independently from vector arithmetic instructions, allowing their execution to be overlapped. To support arbitrary stride and dataaccess granularities (32-bit word, 16-bit halfword, 8-bit byte), crossbars are used to align read and write data. The width of the crossbars are MemWidth (default value of 128 bits), and the parameter MemMinWidth (default value of 8 bits) specifies the smallest data granularity that can be accessed.
The memory interface connects to an on-chip, single-cycle memory implemented in M144K memories. This provides between 96kB and 768kB maximum capacity, depending upon the specific Stratix III device. This should be sufficient buffering for many embedded applications, especially those that process streaming data. If higher capacity is needed, the interface could be connected to a 128-bit SDRAM controller. Modern SDRAM is well suited for burst reading and writing of long vectors, but a cache will likely be needed for scalar data accesses. Since FPGAs run at a relatively slow clock rate compared to modern SDRAM memory, the impact of this cache is not as significant as multi-GHz processors.
The memory unit and crossbars are shown in Figure 6 . The load/store controller issues instructions to the address generators, which also control the memory crossbars. The memory unit is also used to implement vector insert and extract instructions; a bypass register between the write and read interfaces allows data to be passed between the interfaces, and rearranged using the memory crossbars. The memory crossbar can align up to 16 data elements per cycle for unit-stride and constant-stride loads, and 4 elements per cycle for stores. Indexed offset accesses execute at one data element per cycle. The write interface datapath is shown in Figure 7 . It is composed of a multiplexer to select data from vector lane store buffers, a data compress block, a selectable delay network to align data elements, and a write crossbar with MemMinWidth-bit granularity that is MemWidth-bits wide to connect with main memory. Figure 8 shows how the delay network and alignment crossbars are used to handle write offsets and data misalignment for a vector core with four lanes. The write address generator can generate a single write address to write several data elements to memory each cycle. A unit-stride vector store is shown in the figure, but the crossbar logic can handle any constant stride.
The alignment crossbar control logic contains the critical path of the system.
FPGA-Specific Vector Extensions
The VIPERS ISA contains several new instructions, listed in Table I , to take advantage of FPGA resources. Each of these is discussed next.
· 12: 17 Fig. 8 . Data alignment using delay network and crossbar for vector store [Asanovic 1998 ].
The distributed multiply-accumulate chain shown in Figure 5 utilizes the MAC mode of the Stratix III DSP blocks. The current implementation shares one DSP block with every four vector lanes. The vmac instruction triggers all of these DSP blocks to multiply the two specified vector registers and partially sum the results into a distributed accumulator located inside each DSP block. The vcczacc instruction performs final accumulation of these distributed accumulators and resets their contents to 0. This speeds up the otherwise inefficient vector reduction operation to a single instruction pair.
In some applications, it may be desirable to shorten the MAC chain so it does not span all vector lanes. In this case, the MACL parameter specifies the length of the chain, spanning a total of 4 × MACL vector lanes. A new chain is started after this, repeating as often as necessary until all vector lanes are included in a chain. Here, the vcczacc instruction produces a short vector of results, starting at element 0 for the first chain and ending at element NLane 4×M ACL −1 for the last chain. This enables accumulation of multidimensional data, for example. The short vector can also be accumulated again, usually into a single result, by a second vmac/vcczacc instruction pair.
The vector lane local memory described in Section 3.2.5 is implemented using embedded memory blocks. The local memory can be read through the vldl instruction, and written using the vstl instruction. Data written to the local memory can be taken from a vector register, or a value from a scalar register can be broadcast to all local memories. Vector-strided versions of these instructions are also supported.
The adjacent-element shift chain shown in Figure 5 is accessed through the vupshift instruction. It allows fast, single-direction rotation of all data elements in a vector register from position i + 1 to position i. This is faster than the awkward sequence of VIRAM vector insert and extract instructions needed to implement this common operation. One final extension is the inclusion of an absolute difference instruction, vabsdiff, which is useful for data-fitting and motion estimation. The secondary parameters enable or disable optional features of the processor, such as MAC units, local memory, hardware multipliers, vector element shift chain, and logic for vector insert/extract instructions. For example, setting the MACL parameter to 0 disables the multiply-accumulate chain and saves DSP blocks.
Configuration Parameters

BENCHMARKS
Three benchmark kernels representative of data-parallel embedded applications in video compression, image processing, and encryption were selected to run on the vector processor and C2H accelerator. This section describes the process of tuning the benchmarks for these target architectures. The tuning focuses on accelerating the kernel or main inner loops. The kernels were manually vectorized by placing inline assembly instructions in the benchmark C code, then compiling with Nios II GCC (nios2-elf-gcc 3.4.1) at optimization O3. A modified Nios II assembler translates the vector instructions. For the C2H compiler results, manual transformations such as loop unrolling were done to another version of the C code. 
Block Matching Motion Estimation
Block matching motion estimation is used in video compression algorithms. The motion estimation kernel calculates a SAD (Sum-of-Absolute-Difference) value for each position in a [−16, 15] search range and stores the values into an array. It makes no comparisons between the values. The SAD metric is defined as
The Full Search Block Matching Algorithm (FSBMA) matches the current block c to all candidate blocks in the reference frame s within a search range [−16, 15] . It finds the motion vector < m, n > of the block with minimum SAD among the 32 2 search positions. Figure 9 shows C code for the motion estimation kernel.
In a vector implementation of FSBMA, one of the dimensions is handled by vectorizing (removing) the innermost loop. This approach naturally supports a vector length of 16 due to the 16 × 16 pixel size of macroblocks in MPEG. To use a VL of 32, two copies of the current macroblock can be matched against the search area simultaneously. Figure 10 shows how this is accomplished using only unit-stride load memory instructions, which execute the fastest, in the inner loop. The two copies of the macroblock are offset vertically by one row. For each position within the search range, 17 rows are processed, with the calculations in the first and last row partially masked using a vector flag register. Figure 11 shows the vector code for the inner loop, plus code in the next outer loop to extract and accumulate results after processing the entire 16 × 16 window. Code to handle the first and last rows are not shown, but the instruction sequence is similar to the inner loop. The MAC chain is used to reduce partial results to one final result with the vcczacc instruction. This implementation requires 6 instructions in the innermost loop.
To enhance performance, a number of unrolling and rescheduling optimizations can be done. Although these will be done by hand here, they can be automated into a compiler system. For example, the code in Figure 11 exhibits a RAW dependence chain of 3 instructions caused by registers v3 and v4. Due to the lack of forwarding in the VIPERS pipeline, the processor stalls a dependent instruction to resolve the read-after-write hazard. In this example, the vadd instruction is stalled 2 cycles because vabsdiff only uses 2 cycles. To remove these stalls, the loop can be unrolled and rescheduled, as discussed in Section 3.2.1. The updated code is shown in Figure 12 . To further improve performance, the number of memory accesses can be greatly reduced by unrolling the loop and reusing pixel data in the register file. The optimized kernel, shown in Figure 13 , uses a single window with a vector length of 16. The code first loads all 16 rows of the reference frame into vector registers. Note how the calc sad() macro avoids dependence stalls by grouping sets of four independent instructions together. When shifting the block vertically one position, it reuses 15 of the 16 rows of pixels from the reference frame. An updated SAD value for the new position is then calculated. This unrolled approach is up to five times faster, but it requires significantly more instruction memory and vector registers.
Image Median Filter
The median filter replaces each pixel with the median value of a surrounding 5 × 5 window. Figure 14 presents C code that performs a bubble sort on a 5 × 5 image region, stopping early after the top half is sorted to locate the median.
The median filter kernel vectorizes nicely by exploiting outer-loop parallelism. Figure 15 shows how this can be done. Each strip represents one row of MVL pixels, and each row is loaded into a separate vector register. The window of pixels that is being processed will then reside in the same data element over 25 vector registers. After initial setup, the same filtering algorithm as the scalar code can then be used. Figure 16 shows the inner-loop vector assembly. This implementation generates MVL results at a time. Thus, an 8-lane vector processor will generate 32 results at once.
To further improve performance, the median filter kernel was fully unrolled as shown in Figure 17 . All 25 vector registers are loaded at the beginning to eliminate subsequent vector loads. All vector stores are eliminated except one at the very end. By eliminating the redundant loads and stores, this unrolled approach is about three times faster. However, it also requires significantly more instruction memory. 
AES Encryption
The AES encryption kernel computes 10 rounds of encryption on 128 bits of data using a 128-bit key. Only performance results for one intermediate round are included in the kernel, as the final round is slightly different and is not within the main loop.
The AES encryption algorithm [National Institute of Standards and Technology 2001] used here is taken from the MiBench suite [Guthaus et al. 2001] . Each block of 128 bits can be arranged into a 4 × 4 matrix of bytes, termed the AES state. The implementation uses a 1KB (256x32b) lookup table described in Daemen and Rijmen [2002] .
The AES encryption kernel can exploit outer-loop parallelism by loading multiple blocks to be encrypted into different vector lanes. Each 32-bit column of the 128-bit AES state fits into one element of a vector register when VPUW = 32. A vector processor with MVL of 32 can encrypt 32 blocks (4096 bits) of data at a time.
The vector assembly code to encrypt one of the four columns, forming just part of a single round transformation, is shown in Figure 18 . Code to initialize the lookup tables is not shown. The plaintext AES state is loaded from memory into vector registers v1 to v4 using four stride-four load word instructions. Each AES block now resides within a single vector lane, across these four vector registers. For each round transformation requires executing 60 vector and 7 scalar instructions.
Benchmark Tuning for C2H Compiler
The C2H compiler generates a single hardware accelerator for a given C function. The first set of accelerators, termed "push-button," is generated by compiling the source with as little modification as possible.
Additional accelerators are generated by modifying the C code to vary the performance/resource trade-off. Manual transformations are necessary because C2H does not provide the ability to control the amount of loop unrolling or set performance/resource goals. Selecting and unrolling loops involves applying similar concepts to those learned from vectorizing the benchmark. It also requires additional temporary variables (registers) to calculate multiple results in parallel. Hardware knowledge is needed to understand how this creates additional parallelism through pipelining and spatial replication.
Manual unrolling is error prone, and the resulting C code is ugly and cumbersome to modify. Although unrolling the vector assembly code in this article is also done manually, we found unrolled C2H code more difficult to manage. In particular, the AES code, which is more complex than the other benchmarks, was the most challenging to modify. To improve performance, each of the four table-lookup operations needs a dedicated 256-entry, 32-bit memory. Local tables can be automatically created by the compiler, but these must be global to facilitate initialization. Hence, they were manually added to the Nios II memory system. The AES engine was then replicated to support multiple streams, but all streams share the same four lookup tables. 
AREA AND PERFORMANCE RESULTS
This section compares area and speedup results for the three benchmark applications between the Nios II soft processor, the VIPERS soft vector processor, and the Altera C2H compiler.
Resource Utilization
Performance and area scalability was one of the primary design goals of VIPERS. This section gives VIPERS area utilization and clock frequency trends for different combinations of configuration parameters, and compares resource usage of VIPERS to Nios II and C2H accelerators. All compilation is performed using Altera's Quartus II version 7.2.
VIPERS Resource Utilization.
To illustrate the scalability of VIPERS, several configurations with different parameter settings are shown in Tables III and IV.  Table III illustrates the range in resource utilization when changing NLane from 16 down to 4. The VxF configurations consist of x vector lanes, the full feature set of the architecture, and support 8-, 16-, and 32-bit data access granularity. The flexible memory interface is the single largest individual component of the processor. It uses over 50% of the ALMs in V4F, and 35% of the ALMs in V16F. Setting MemMinWidth to 16 and 32, indicated by y in VxMy, changes the minimum data access granularity. This affects the size of the memory crossbars; a larger value saves more area. Table IV shows the effect of successively removing secondary processor features from a V8F configuration. The V8Cz configurations remove local memory, distributed accumulators and vector lane multipliers, and vector insert/extract and element shifting instructions, respectively. Area can be further reduced by adjusting the primary parameters: V8W16 reduces VPUW to 16, and V8W16M16 supports only 16-bit memory accesses. Overall, more than 3,700 ALMs can be saved if these features are not needed. This savings is enough to implement six additional Nios II/s processor cores.
These results demonstrate that resource usage can span a wide range, depending upon the performance and processor features needed. Some additional savings may be possible by fine-grained instruction subsetting, but VIPERS does not currently support this feature. Nios II and C2H . Application-specific configurations of the soft vector processor, C2H accelerators, and the Nios II/s processor were compiled to measure their resource usage. The Nios II/s (standard) version is used as the baseline for area comparisons. The Nios II/s comes with a 5-stage pipeline, static branch prediction, and a 3-cycle multiplier and shifter. It is further configured with a 1KB instruction cache, 32KB each of on-chip program and data memory, and no debug core. Compilation targets a Stratix III EP3SL340 device in the C3 speed grade. vector processing, so 16-bit ALUs are used. No local memories are instantiated for these two benchmarks, and MAC units were instantiated only for motion estimation. The AES encryption kernel requires only 32-bit word access, so MemMinWidth was set to 32 bits. The AES kernel also requires a 256-word vector lane local memory. These customizations save about 30% of ALMs required by a full 16-lane processor.
Resource Comparison to
Performance Results
This section compares the performance of VIPERS to the Nios II processor and C2H accelerators. In addition, performance of VIPERS is compared to a hand-crafted RTL implementation.
Performance Models and Methodology.
Performance of Nios II, VIPERS, and C2H accelerators are compared using execution time of the three benchmark kernels. Execution time is the product of total clock cycles and the minimum clock period (1/F max ). The three target architectures use three different methods for calculating the total clock cycles; these will be described shortly. The Altera Quartus II TimeQuest analyzer is used to determine F max .
The Nios II processor cycles form the baseline case for all speedups. They are determined by assuming an ideal pipelined processor. In this model, every Nios II instruction is executed in 1 clock cycle, including memory operations and branches; there are no stalls, mispredictions, cache misses, etc. This gives the best-possible Nios II performance. For area and F max results, a real Nios II/s processor is used.
The number of cycles required by each C2H accelerator is extracted from the C2H compiler performance report. Cycles are calculated for the main loop only, using the loop latency and cycles per loop iteration figures. Only the accelerator performance is considered; no Nios II instructions are included. It is worth noting that this measure of C2H performance is optimistic because usually Nios II caches must be flushed prior to invoking the accelerator. For area and F max results, the accelerators are compiled targeting a slower Stratix II architecture.
The VIPERS processor cycles are determined by running the benchmark inside the ModelSim Verilog simulator and counting the cycles. This produces a realistic, cycle-accurate result of the RTL implementation. This captures all pipeline and memory stalls as well as concurrent operation of vector memory, vector arithmetic, and scalar instructions. It is worth noting that the UT IIe scalar core used in VIPERS requires four cycles to execute each scalar instruction. This adds considerable overhead to the median filter benchmark, for example. For area and F max results, application-specific instances of VIPERS are used.
To help software programmers optimize their code, the performance model in Table VI summarizes the cycles to execute several types of instructions. These values are determined from our knowledge of the Verilog source and ModelSim simulation experience. Most scalar instructions take four cycles each, except memory operations which take longer because of the memory crossbars. Vector instructions vary depending upon runtime conditions, but (NLane, MaxElem, MemWidth/ V PUW) 4 StrideLength = V L · min (Stride, NumElem1) 128 VecDataWidth is the data width of the particular vector memory access 32 C is 4 for vector loads or stores; add 1 extra cycle for unaligned addresses 4 * assumes NLane = 16, V PUW = 32, MemWidth = 128, V L = 64, Stride = 1 or 2 this portion of latency can be hidden by overlapping execution with non-memory instructions overlapped execution for this portion of latency is possible, but not yet implemented typical cases require 4 cycles to execute a vector length of 64 across 16 lanes. Vector loads and stores, however, need additional cycles due to load buffers, store buffers, strided access, and data alignment. Hence, the performance model gives the execution latency for several different situations. Each equation typically has three terms: a constant term for instruction issue and other overhead, a variable term for data transfer between the vector register file and the load/store buffer, and another variable term for data transfer and realignment through the crossbars between the load/store buffer and memory.
Due to stalls, concurrency, changes to the vector length, and other dynamic runtime situations, VIPERS instructions do not always take the same number of cycles to execute. The performance model in Table VI does not capture these dynamic events. For example, it ignores dependence stalls and the potential overlap of scalar, vector, and vector memory instructions. In particular, in some situations a significant part of the load or store latency can be hidden by the load/store buffers. Programmers will have to use their experience and intuition to determine the impact of these types of events when optimizing their code. Table VII shows the performance of Nios II, C2H, and several VIPERS configurations on the three benchmarks. The C2H results are from the "push-button" compilation.
Vector and C2H Performance Comparison.
Despite having a lower clock frequency, all vector configurations show speedup over the baseline ideal Nios II/s. Some of the improvement comes from lower loop overhead, as vector instructions significantly reduce the number of increment/compare/branch instructions executed. In addition, greater performance is obtained when more vector lanes are used. The instruction count and clock cycles per result decreases for median filtering and AES encryption as more vector lanes are added because more results are computed in parallel. Note this sometimes results in fractional instruction counts per block after the total instructions are divided by the number of blocks encrypted in parallel. For the motion estimation kernel (not unrolled), two versions of the code are used: one for V4 with a vector length of 16, and another for V8 and V16 with a vector length of 32. In the second version, the assembly source handles two SAD window calculations in parallel using the code shown in Figure 18 . The fully unrolled motion estimation is a third version of the code where all processor configurations run the same single-window code using a vector length of 16. Figure 19 plots speedup of the VIPERS processors and C2H accelerators in comparison to the execution time of the ideal Nios II single-cycle model. This is plotted against the number of ALMs used, normalized to the real Nios II/s core. The bold/dark lines show performance of VIPERS with 4, 8, and 16 vector lanes. The median filtering and AES results also include a 32-lane version. The filled/gray data points show performance of the "push-button" C2H accelerators, and the thin/gray lines show the performance improvement when the C2H accelerators are scaled and optimized as described in Section 4.4. Note the diagonal line shows one-to-one speedup versus area, representing speedup of a MIMD multiprocessor system that has no interconnect or communication overhead.
The 4-lane VIPERS configurations for all three benchmarks have a 64-bit memory interface, while configurations with more lanes have a 128-bit memory interface. The 64-bit interface requires fewer ALMs, pushing the V4 points farther left (lower area) in the graph. This often produces in a visual "kink" at the V8 configuration. This is not as apparent in the V8 configuration for AES, which has a smaller memory interface due to support for only 32-bit words, or motion estimation where a different (slower) version of the code is used. This illustrates the large overhead of the vector memory interface. When the number of vector lanes increases to 16, AES quickly catches up in ALM usage because it uses VPUW = 32 instead of VPUW = 16 for the others.
Speedups of the "push-button" C2H accelerators are similar to those of the V4 processor configurations for motion estimation and AES. For the median filter, however, C2H is slower than Nios II, partly due to a drop in clock speed. Although performance increases by scaling the C2H accelerators, notice that it saturates in all cases. After being unrolled many times, simultaneous access to the same variables in main memory eventually becomes the bottleneck. Adding the four lookup tables for the AES kernel achieves a large speedup over the "push-button" result, indicated by the large jump of the second data point on the curve. Further replicating the AES engine up to four times creates contention at these lookup table memories and results in little additional speedup. In all three benchmarks, additional C2H performance is possible by manually changing the Nios II memory system.
In contrast, a soft vector processor automatically scales performance by scaling the number of vector lanes. This scales the vector register file bandwidth at the same time. Since most operations inside inner loops of kernels read and write vector registers, increasing the register file bandwidth allows more data elements to be accessed and processed in parallel each cycle, increasing performance. This avoids the need to rearchitect the memory system as required with C2H. The vector programming model is a powerful abstraction for this scalable architecture, providing the same unified memory system to the user across several different configurations with zero hardware design effort.
Vector and RTL Performance
Comparison. Motion estimation and AES are commonly implemented using hand-crafted RTL accelerators. In · 12: 31 particular, changes to the motion estimation algorithm are common, so a flexible software implementation would be more desirable than a fixed hardware implementation. Hence, it is useful to compare performance of VIPERS to an fixed RTL implementation.
Motion estimation in VIPERS with 16 lanes achieves a throughput of 1,898 Full Block Searches per Second (FBSS). In comparison, a fixed FPGA-based RTL implementation described in Li and Leong [2008] achieves 155,239 FBSS, which is 82× faster. However, one small change to the VIPERS ISA (combining vabsdiff and vmac into one instruction) can nearly double its throughput. Also, a higher clock frequency with deeper pipelining can potentially double it again. These two improvements would shorten the RTL advantage to just 21× faster. Although it is not as fast as hardware, it is impressive to achieve this speed with a purely software approach.
The VIPERS and hand-crafted RTL approaches are similar in area. The hand-crafted RTL uses 5,789 Xilinx Virtex-II Pro slices, compared to 7,983 ALMs for VIPERS. Figure 19 , produce significant performance improvements. In both cases, the main idea is to reuse data in the vector registers as much as possible to avoid redundant memory loads. Also, loop overhead instructions are eliminated. This produces an impressive 3 to 5 times speedup over the original vector code, and shows how important it is to pay attention to memory operations. In the vector programming model, costly vector load and store operations are explicit. This gives programmers an explicit target to optimize.
Programming in vector assembly and achieving this level of performance is not difficult. In three weeks, a third-year ECE undergraduate was able to study the VIPERS documentation and speed up both applications. The student had taken introductory courses in digital logic and assembly language programming, but had no background in computer architecture or VHDL/Verilog design.
CONCLUSIONS AND FUTURE WORK
As the performance requirements and complexity of embedded systems continue to increase, designers need a high-performance platform that reduces development effort and time-to-market. A soft vector processor provides the advantages of scalable performance and resource usage, being simple to use with no hardware design knowledge or effort, and early decoupling of the hardware (RTL) and software design flows. Decoupling the flows is extremely important because software changes will not introduce additional hardware recompiles, saving many lengthy place-and-route iterations and more quickly reaching timing closure for the entire system. It also means a single vector processor can serve as a general-purpose accelerator for multiple applications.
Embedded software programmers can easily understand and control performance and area through a configurable number of vector lanes and the vector ALU width. Also, programmers can easily optimize software performance by applying the vector memory model; reusing data in vector registers is highly preferred to rereading data from memory.
The soft vector processor architecture developed in this article, VIPERS, outperforms a Nios II processor and the Altera C2H compiler on the three benchmarks shown. For example, VIPERS with 16 vector lanes achieves a speedup of 25× with just 14× the area of a Nios II processor on the median filter. The VIPERS implementation includes a flexible, single-bank memory interface that can support different memory widths and data access granularities of 8, 16, or 32 bits. This vector memory interface is a significant fraction of the total area, and also contains the critical path of the system. By customizing the soft vector processor to the benchmarks, area savings of 30% was achieved compared to a full-featured configuration with 16 lanes. This 30% is a large amount of silicon; enough logic to implement six more Nios II/s processors.
Performance of the current implementation is hindered by a slow scalar core. The ability to overlap execution of scalar instructions via traditional pipelining is a minimum first step. However, increased overlap of scalar and vector instructions is also needed through a dual-issue approach. The best way to implement this needs to be studied.
Future work should also consider architectural improvements to the vector core. One benefit of working with a soft vector architecture is that architectural proposals do not have to benefit all applications. This is because features can be selectively included at system generation time. For example, a new vfir instruction that uses the FIR mode of the DSP blocks would be useful for only some applications. Another benefit of working with FPGA-based implementations is the cost of resources is very different than a full-custom approach. For example, the hardware multipliers are already built into FPGA devices; full-custom vector processors must consider the trade-off whether to build a multiplier or more on-chip memory, but FPGA implementations can use a multiplier for free. This new cost framework makes it important to reevaluate past architectural ideas in addition to more recent ones such as CODE [Kozyrakis and Patterson 2003b] and the Vector-Thread (VT) architecture [Krashinsky et al. 2004] .
