Abstract. In this paper, we examine the trade-offs in performance and area due to customizing the datapath and instruction set architecture of a soft VLIW processor implemented in a high-density FPGA. In addition to describing our processor, we describe a number of microarchitectural optimizations we used to reduce the area of the datapath. We also describe the tools we developed to customize, generate, and program our processor. Our experimental results show that datapath and instruction set customization achieve high levels of performance, and that using onchip resources and implementing microarchitectural optimizations like selective data forwarding help keep FPGA resource utilization in check.
Introduction
The proliferation of fast, high-density, and feature-rich FPGAs has transformed these versatile devices into powerful embedded computational platforms. In addition to embedding hard microprocessor cores in FPGAs, vendors are providing soft processor cores that can be implemented in the logic fabrics of their FPGAs [1, 2] . Soft processors enable designers to configure their datapaths to meet the needs of target applications. In most cases, designers can also create new custom instructions to accelerate performance-critical operations. So far, commercial soft processors have remained simple, featuring datapaths and instruction set architectures that resemble early RISC processors. However, as the speeds and logic densities of FPGAs increase, and as more on-chip resources (e.g. hardware multipliers and block memories) become available, it is becoming possible to implement more complex processor datapaths and instruction set architectures.
In this paper we present the results of a study we conducted to assess the impact of datapath and ISA customization on the performance and area of soft VLIW processors. In Sect. 2 we discuss related work and compare it with our own. In Sect. 3 we describe the datapath organization and instruction set architecture of our soft VLIW processor. Next, in Sect. 4 , we describe a number of microarchitectural optimizations we used to reduce the area of our processors. In Sect. 5 we describe our development tools and work flow, and in Sect. 6 we present and discuss our results. Finally, in Sect. 7, we present our conclusions and describe future work.
Related Work
VLIW processors have long been used to implement application-specific instruction processors (ASIPs) for embedded applications [3] . In addition to their simple hardware implementations, VLIW processors achieve high levels of performance by exploiting ILP and supporting datapath and ISA customization [4, 5] . However, since most ASIPs are custom processors implemented as standalone components or embedded within ASICs they are difficult and expensive to reconfigure or re-customize after they have been designed and manufactured.
To address this shortcoming, there has been increased commercial and academic interest in configurable, customizable, and reconfigurable processor architectures over the past few years. Companies like ARC and Tensilica have long been offering configurable and customizable processors along with supporting tool chains [6, 7] . Academic researchers have also been embedding FPGA-like reconfigurable functional units (RFUs) in processor datapaths to implement custom instructions [8, 9] , and customizable co-processors to offload performancecritical computations from their host processors [10] . However, most customizable processors are also aimed at the ASIC market, making them difficult to recustomize once they have been manufactured; and RFUs are often constrained by fixed architectures that limit the range of custom operations they can support. Although soft-core versions of these processors are commonly used for functional validation, they are not designed to be implemented in FPGAs.
Soft processors are an alternative to ASIPs and customizable processors since they can easily and quickly be reconfigured and implemented in FPGAs. They can also leverage available FPGA resources to provide very efficient implementations. In [11, 12] , the authors describe the Soft Processor Rapid Exploration Environment (SPREE), which they used to create and evaluate the area, execution performance, and energy consumption of various soft processor implementations. The results of this study provide several insights into the trade-offs involved in designing soft processors that issue and execute single instructions. Our work builds upon these results and examines the effects of datapath and ISA customization on the performance and area of soft VLIW datapaths.
Our VLIW processor is not the first to be implemented in an FPGA. In [13, 14] , the authors describe a soft VLIW processor consisting of four, identical, 32-bit ALUs and a customizable hardware block for accelerating performancecritical loop kernels. The ALUs and the hardware block operate in parallel and are interconnected through a single, multi-ported, 32 × 32-bit register file. Our processor differs from this processor in several ways, and in Sect. 3 we describe its organization and architecture in more detail.
Processor Architecture
In this section we describe the datapath, instruction set architecture, and pipeline organization of our processor. Figure 1 shows the datapath of our soft VLIW processor. The datapath can be configured to include any number of 16, 32, or 64-bit general-purpose functional units. These include arithmetic and logic units (ALUs), multiply-accumulate units (MACs), address generation units (AGUs), and data memory units (DMUs). AGUs compute data memory addresses and DMUs execute memory-access operations. Every DMU is connected to a single-ported data memory bank that is implemented using an on-chip RAM block. The datapath may also include a number of custom computational units (CCUs), which execute user-defined machine operations that are used to customize and extend the basic ISA. Finally, the datapath includes a single control and branching unit (CTL) that performs data movement, branching, and other control-flow operations. The datapath also includes up to three distributed register files, each with a configurable number of registers and access ports. Distributed register files use fewer read and write ports, and are smaller and faster than a unified, multiported register file of the same aggregate size. The distributed register files include a data register file (DRF), an address register file (ARF), and an optional custom register file (CRF). Data can be transferred between the different register files using special data movement operations executed by the CTL unit.
Datapath Organization
Finally, to provide basic functionality, each processor must be configured with a minimal mix of functional units that includes one ALU, one AGU, and one DMU. These, in turn, require the use of the DRF and ARF. By default, each functional unit is 32 bits wide and each register file consists of 32 registers with two read ports and one write port. For the remainder of this paper, we will refer to this configuration as our base processor (BP).
Instruction Set Architecture
Our processor executes a basic set of integer machine operations that resemble MIPS R2000 instructions [15] . Most operations, including data memory access operations, have a latency of one processor clock cycle. On the other hand, custom machine operations extend the basic instruction set and may have, depending on their complexities, multi-clock-cycle latencies.
The length of an instruction depends on the underlying datapath configuration since every functional unit in the datapath has a corresponding operation field in the instruction word. Although such instruction formats result in poor code density, several techniques are available to tackle this problem [3] . Since we are only interested in studying the effect of customization on the performance and area of the datapath, we do not address the issue of code density in this paper. Accordingly, we do not account for the area or latency of instruction memory in our results.
Custom Computational Units
Custom computational units (CCUs) are user-designed combinational logic blocks that realize custom machine operations in hardware. CCUs may be pipelined but do not have to be. Figure 2 shows three different ways for adding CCUs to the datapath. The first ( Fig. 2(a) ) extends the functionality of a regular functional unit by augmenting it with custom logic. Custom operations executing on such CCUs are constrained to only operate on a pair of operands, produce a single result, and have a latency of one processor clock cycle. Another way to add a CCU to the datapath is to cluster the CCU with other functional units that share a register file ( Fig. 2(b) ). This provides a tight coupling between the CCU and the other functional units, which enables the units to share and exchange data efficiently. However, this also requires additional register ports, which increase the size and latency of the register file. Finally, a CCU can be added to the datapath by connecting it to a dedicated CRF (Fig. 2(c) ). This approach provides a looser coupling between the CCU and other functional units in the datapath, and requires explicit machine operations to move data between the CRF and other register files.
Pipeline Structure
Our base processor is designed around a classical, four-stage pipeline that includes instruction fetch (IF), instruction decode (ID), execute (EX), and writeback (WB) stages [15] . During the ID stage, long instructions fetched from instruction memory are decoded and their corresponding machine operations are dispatched for execution on the appropriate functional units. During the EX stage, these operations are executed in parallel on available functional units. Although most operations complete execution in a single clock cycle, some custome machine operations may require several clock cycles to complete. During this time, other operations, from later instructions, can continue executing on other functional units. Although this leads to out-of-order operation completion, which may lead to contention for register write ports and WAW hazards, both can easily be avoided through careful instruction scheduling. Contention for register write ports can also be eliminated by increasing the number of register write ports. Finally, the results of various operations are written back to the corresponding register files during the WB stage. To minimize the effect of RAW data hazards, data forwarding is used to bypass results between the WB and EX stages.
Microarchitecture Optimizations
In this section we describe the various microarchitectural optimizations we used to reduce the area of the datapath.
Hardware Multipliers
Contemporary FPGAs provide on-chip hardware multipliers that are faster and smaller alternatives to implementing multipliers using the logic resources of the FPGA. We used hardware multipliers to implement the MAC functional units in our datapath. To support 32 × 32 multiplications, each MAC unit uses four, 18 × 18, hardware multipliers.
Hardware multipliers can also be used to implement fast and area-efficient shifters since shifting a value by one bit position to the left is equivalent to multiplying it by two [16] . That is why we also used the hardware multipliers to implement arithmetic and logical shifters for the ALU functional unit. To support 32-bit shift operations, each shifter uses two, 18 × 18, hardware multipliers. 
RAM Blocks
RAM blocks are memory cells embedded within the FPGA fabric and used to implement a wide range of storage devices [17] . Memory banks and register files implemented using RAM blocks are faster and smaller than those implemented using the flips-flops found in FPGA logic blocks. The dual-ported RAM blocks embedded within the Xilinx Virtex-II FPGA we used for this study can each be configured to store 32-or 64-bit words. The two ports on each RAM block can also be configured to serve as read, write, or read/write ports.
We used RAM blocks in our processor to implement a distributed data memory bank. Every DMU in the datapath is connected to a dedicated RAM block, which enables multiple data words to be accessed simultaneously. Distributed data memory banks are commonly used in programmable DSPs and provide a fast, low-cost, and high-bandwidth alternative to a large, unified, multiported data memory bank.
We also used RAM blocks to implement multi-ported register files. Figure 3 shows how eight, dual-ported RAM blocks can be used to implement a register file with four read and two write ports. In general, a register file with M read ports and N write ports requires M ×N RAM blocks distributed across N banks. Each bank stores duplicate copies of a subset of the registers and provides parallel access to this subset.
Selective Data Forwarding
Data forwarding is a well known technique for eliminating RAW data hazards in pipelined processors. In a VLIW datapath, the number of forwarding paths and the complexity of the forwarding logic grow in proportion to the number of functional units used in the datapath. When implemented in FPGAs, the area occupied by forwarding paths, multiplexers, and control logic can grow significantly. To reduce this area, we use selective data forwarding where only the forwarding paths actually needed to support performance-critical code blocks are maintained. Although this makes forwarding difficult in code blocks that have other bypassing needs, dependent operations in these blocks can be scheduled in a way that eliminates RAW hazards. If dependencies cannot be eliminated through code scheduling, NOPs can be inserted between dependent operations, which, of course, degrades performance. Another solution exploits the field programmability of FPGAs to load processor configurations on a per-application basis to benefit from selective data forwarding without affecting execution performance. In this case, the only overhead is that involved in reprogramming the FPGA.
Development Tools
Configurable processors require a flexible tool chain that enables designers to quickly configure, generate, implement, and program new processor designs. Figure 4 shows the various tools we developed for our VLIW processor and the way they interact with each other. A key component in our tool chain is the processor configuration file (PCF), which captures the architectural and organizational parameters of a processor and is used to retarget both the datapath generator and the assembler. The PCF also describes the CCUs in the datapath and includes the VHDL implementations of their custom operations. PCFs are automatically generated by our processor configuration file generator (PCFGen), which provides a graphical user interface for specifying the architectural Implements 32-bit reverse. SCB Implements a bit scrambling function [18] . PUNC Implements a bit puncturing function [18] . SAD16 Implements the 16 × 1 sum of absolute differences function [19] .
parameters of a target processor. The datapath generator (DPGen) extracts relevant information from the PCF and generates VHDL code and a testbench for the corresponding processor datapath. The retargettable assembler also extracts relevant information about the processor's ISA from the PCF and uses the information to generate long machine instructions in a binary format that matches the architecture of the target processor. To validate the functional behavior and measure the execution cycle counts of our processor, we used Mentor Graphics ModelSim 6.0a. We also used Xilinx ISE 7.1 to synthesize and implement the corresponding datapaths in a 190 MHz Xilinx XCV2000-F896 Virtex-II FPGA. Finally, we used the various reports generated by the Xilinx tools to measure processor clock frequencies and FPGA resource utilization.
Results
In this section we present the outcomes of different experiments that illustrate the performance and area trade-offs due to datapath and ISA customization of various soft VLIW datapaths. Since we still do not have a working high-level C compiler, we developed short assembly-language routines to assess the impact of specific architectural configurations on performance and area. To ensure our results reflect typical design trade-offs, we chose our kernel benchmarks from a range of embedded application domains. Table 1 shows the different kernel benchmarks we developed for this study.
In our results, we report the execution performance of a given benchmark as its wall clock execution time. This is computed as the product of the dynamic cycle count for the benchmark, obtained from the ModelSim behavioral simulation, and the processor clock cycle time, obtained from the Xilinx XST synthesis report. We also report the area for a given datapath implementation in terms of the number of FPGA slices, RAM blocks, and hardware multipliers used. These numbers are also obtained from the Xilinx XST synthesis report.
Base Processor vs. Xilinx MicroBlaze
In our first experiment, we compared the performance and area of our base processor with the Xilinx MicroBlaze soft processor. Tables 2 and 3 summarize the performance and area of different implementations of our base processor (BP and BP+MAC) with those of the MicroBlaze (MB and MB+MUL) when executing two general-purpose computational kernels. To avoid giving our VLIW processor an unfair performance advantage, we made sure both kernels did not exploit any parallelism. Our results show that our base processor is 1.42 to 2.23 times faster than the MicroBlaze when executing the FIB100 benchmark and 2.11 to 3.23 times faster when executing the FACT100 benchmark. The higher performance of our processor is mainly due to lower cycle counts, which are achieved because of lowoverhead looping operations. These operations are part of the base instruction set and are used to eliminate test-and-branch operations in conditional loops. The higher performance is also due to faster processor clock frequencies, particularly when selective and no data forwarding are used (cf. columns labeled SEL and NO, respectively), which are due to the deeper pipeline of our processor. When full forwarding is used (cf. columns labeled FULL), longer delays in the datapath cause our processor to achieve clock frequencies that are comparable to the MicroBlaze. When no data forwarding is used, the cycle counts in our processors increase significantly due to additional NOP operations that must be inserted in the code to eliminate RAW data hazards. Our results also show that our base processor uses 0.88 to 1.47 times the slices used by the MicroBlaze. When we add a multiplier unit to both processors, our processor uses 0.97 to 1.95 times the slices used by the MicroBlaze. When full forwarding is used, our processors use more slices due to additional forwarding logic and data paths. On the other hand, when selective or no data forwarding are used, our processors use fewer slices. However, since the MicroBlaze uses additional logic to implement bus interfaces and support interrupts, these results are slightly skewed in our favor. Finally, since our processors use distributed register files with a large number of read ports they use significantly more RAM blocks than the MicroBlaze.
Customizing the Datapath to Exploit ILP
In our second experiment, we studied the effects of customizing the datapath to support instruction-level parallelism. For this experiment, we used the FIR256 kernel because it exhibits high levels of parallelism, and we examined three processor implementations with progressively increasing support for parallelism: the base processor augmented with a MAC unit (BP+MAC); BP+MAC with a twowrite-port DRF (BP+MAC+DRF2W); and a custom datapath that includes 1 ALU, 1 MAC, 2 AGUs, 2 DMUs, a three-write-port DRF, and a two-writeport ARF (FIR+DRF3W+ARF2W). Table 4 summaries our results.
Our results show that as support for parallelism increases in the datapath, performance increases accordingly. For example, with full forwarding, BP+MAC+DRF2W and FIR+DRF3W+ARF2W are 1.49 and 2.55 times faster than BP+MAC, respectively. This is mainly due to lower cycle counts, which are a direct consequence of exploiting higher levels of parallelism. On the other hand, supporting higher levels of parallelism increases datapath complexity, which reduces processor clock frequency. Still, the result is a net improvement in performance. Performance can be further improved by using selective forwarding. For example, using selective forwarding, BP+MAC, BP+MAC+DRF2W, and FIR+DRF3W+ARF2W are 1.29, 1.98, and 2.94 times faster, respectively, than BP+MAC with full forwarding. This is mainly due to the faster processor clock frequencies that can be derived by reducing the complexity of the datapath. Finally, when no data forwarding is used, and despite higher clock frequencies, performance is consistently poorer than BP+MAC with full forwarding. This is mainly due to the excessive overhead introduced by NOP operations used to eliminate RAW data hazards.
Our results also show that supporting parallelism increases the number of FPGA resources used to implement the datapath. For example, when full forwarding is used, BP+MAC+DRF2W and FIR+DRF3W+ARF2W use 1.12 and 2.16 times as many slices as BP+MAC, respectively. Although the number of slices can be reduced significantly by using selective or no data forwarding, the savings must be weighed against the resulting levels of performance. When no data forwarding is used, the savings are achieved at the expense of degraded performance. Finally, the number of RAM blocks used in BP+MAC+DRF2W and FIR+DRF3W+ARF2W is 1.57 and 3.21 times the number used in BP+MAC, respectively, due to the increased number of read and write ports in both register files.
Adding a Custom Operation to the AGU
In our third experiment, we studied the effect of adding a bit-reverse operation to the base instruction set by augmenting the AGU with custom logic. We used the REV32 benchmark to compare the performance and area of our base processor (BP) with another implementation of BP that uses an augmented AGU (BP+REVAGU). Table 5 summarizes our results, which show that the custom operation in BP+REVAGU is 50.00 to 76.57 times faster than a software implementation executing on BP when full and no data forwarding are used, respectively. The higher performance is due to the significantly lower cycle count achieved by using the custom operation, and, in the case of no data forwarding, the higher processor clock frequency. Our results also show that the impact of adding the custom operation on FPGA resources is negligible. This is due to the simple implementation of the bit-reverse operation, which requires a few additional slices only. It is also due to the microarchitectural constraints on adding custom logic to a functional unit, which do not affect the number of register ports and keep the number of RAM blocks in the datapath constant.
Adding Multi-Cycle CCUs to the Datapath
In our fourth experiment, we studied the effects of customizing the instruction set by adding three, different, non-pipelined CCUs to the datapath. The SCB CCU implements a bit scrambling function used in WLAN OFDM modems [18] . It uses two 32-bit inputs, produces a single 32-bit output, and has a latency of eight clock cycles. The PUNC CCU implements a bit puncturing function, which is also used in the convolutional encoders and interleavers of WLAN OFDM modems [18] . It uses four 32-bit inputs, produces one 32-bit output, and has a latency of 16 clock cycles. Finally, the SAD16 CCU implements the 16 × 1 sum-of-absolute-differences function, which is used for motion estimation in MPEG4 video encoders [19] . It uses eight 32-bit inputs, produces one 32-bit output, and has a latency of four clock cycles. For each of these CCUs we created two processor implementations: one that connects the CCU to the DRF (BP+DRFSCB, BP+DRFPUNC, and BP+DRFSAD16) and another that connects the CCU to a dedicated 32 × 32-bit CRF (BP+CRFSCB, BP+CRFPUNC, and BP+CRFSAD16). Moreover, for each implementation, we considered the cases where full and no data forwarding were used, respectively. We then compared these processors to our base processor when executing the SCB, PUNC, and SAD16 kernels, which are software implementations of the corresponding CCU functions. Table 6 summarizes our results.
Our results show that, when connected to the DRF, the CCUs increase the performance of the BP by factors of 8.28 to 23.61 when full forwarding is used, and 15.92 to 43.06 when no forwarding is used. This is mainly due to the significantly lower cycle counts achieved when implementing the CCU functions in hardware. However, it is worth noting that in most cases the additional complexity resulting from introducing CCUs to the datapath also decreases the processor clock frequency. Still, the net effect is an improvement in performance. Our results also show that when full forwarding is used, the number of FPGA slices used by the CCU-enhanced datapaths is 1.37 to 3.38 times greater than those used by the BP, and that they use more RAM blocks. This is due to the additional logic required to implement the CCUs, additional forwarding paths and logic, and support for more DRF read ports. When no data forwarding is used, the CCU-enhanced datapaths use significantly less FPGA slices. However, these still correspond to 1.31 to 4.45 times the number of slices used by the BP. On the other hand, the number of RAM blocks used by each processor remains the same.
Our results also show that, when connected to a dedicated CRF, the CCUs increase the performance of the BP by factors of 5.98 to 10.89 when full forwarding is used, and 10.87 to 13.15 when no forwarding is used. This, again, is due to the significantly lower cycle counts achieved by implementing the CCU functions in hardware. However, these generally lower levels of performance are due to the overhead of transferring data and results between the DRF and the
