Abstract
Introduction
Current technology demands have spurred an exceptional development in the way computers are designed. Advanced computer architectures take advantages of the concepts of out-of-order superscalar architectures, aggressive speculative techniques, high bandwidth caches and distributed processor architectures. Improvements in integration density of components on a chip using VLSI techniques and the corresponding lower costs have enabled integrating complete processors with some memory on a single chip for improved performance.
Embedded systems and special purpose architectures can follow techniques similar to those used by general-purpose computers for performance enhancement. However, they are also constrained by area and power considerations and design decisions can be vastly different for them. Current trends show that technologically advanced products are moving towards multi-functional systems. Classic examples include cellphones and smart devices. Thus, it may be desired to cram greater functionality into embedded systems, while decreasing area and power requirements and improving performance. When applications have vastly different characteristics, the task of designing systems with such constraints can be daunting. One such design decision is to determine the number of functional units of each type that should be incorporated in the arithmetic and logic unit (ALU). To cater to the varied nature of applications, we propose a dynamic scheme called block slicing that leads to efficient utilization of available resources and offers enhanced performance while introducing minimal aditional hardware. This is a general scheme that can be applied to any processor to ensure that applications that need fewer resources than available have a gain in performance and power.
Relation to prior work
Superscalar architectures were built with a view to extract parallelism from data and instructions. Multiple instructions and data are fetched simultaneously and out-oforder execution is enabled to reduce stalls. The main architectural challenge is to issue multiple instructions per cycle and to do so efficiently. Instruction level parallelism (ILP) can be achieved by efficient scheduling, more number of execution units and high scheduling bandwidth. SIMD type of processors like Vector processors parallelly process multiple sets of data to achieve a high throughput. They rely on the nature of the task to achieve the parallel performance. The Intel MMX is an SIMD instruction set to process multiple datasets within a single execution cycle. When the processor encounters an MMX instruction, it interprets the data registers as a collection of data, and performs the same operation on all operands. This increases the throughput of the processor for tasks that operate on small bit sizes.
There have been other special architectures proposed that add a certain degree of flexibility to a processor. Wirthlin et al [6] proposed the Dynamic Instruction Set Computer (DISC) architecture that can support dynamic modification of its instruction set based on the demand of the incoming instruction. The DISC architecture had two significant features: Partial FPGA reconfiguration, that provided the ability to reconfigure a sub section of an FPGA while letting the remaining logic operate unaffected; and Relocatable hardware, that gives the flexibility to relocate or make placement decisions of partial configurations at run time in order to [2] , an instruction coding technique that reduces the width of the instructions using dynamic instruction coding using compilers. The compiler selected the instruction set and parallel hardware functions based on the number of bits required for the instructions. Chandra Shekhar et al. tried to realize benefits of software based general purpose architectures and dedicated hardware architectures through Application Specific Instrution Set Processor (ASIP) [5] architectures. ASIPs are suitable for embedded applications as they permit an alteration of hardware-software boundary to meet the speed and energy constraints of a specific application. All processor described above have static datapaths. The hardware is incapable of adapting to input tasks at runtime. Hardware is usually designed with sufficient resources for all possible types of applications expected to run on it. However, all tasks that require minimal resources and the tasks that require maximum resources pass through the same datapath, which reduces the overall utilization of hardware. This issue has been addressed in this work. The processor can operate on smaller data sets independently, without the need for any special instructions. It identifies and tries to schedule instructions in any task that would otherwise stall due to lack of resources. A small amount of hardware detects the presence of instructions that do not need to use the full word size of the execution units and schedules them on only a part of the unit, leaving the remaining part to operate on other instructions. Thus the ALU can do different operations simultaneously. Table 2 [3] lists the average of MIPS dynamic instruction mix of five SPECint2000 programs: gap, gcc, gzip, mcf, perl, and five SPECfp2000 programs: applu, art, equake, lucas, swim.
Sliced Processor Architecture
In a superscalar processor, execution units are provided to service all types of instructions present in the instruction set that need a computation unit. A generic instruction set consists of four basic types of instructions: ALU, Branch, Load and Store. The number of units allotted to each type of instruction affects the space requirements, power usage and additional logic necessary for smooth functioning of these units in parallel. The optimum number of execution units of each type included in the stage is typically based on applications served by the processor and the type of tasks that are expected to run on it. The instruction mix in the SPECint2000 programs consists of a large percentage of integer operations of add, subtract, compare, shift, and, or and exclusive-or. To cater to the high percentage of ALU instructions, it may be necessary to include more than one ALU units. The number of ALU instructions can vary wildly from one benchmark suite to another. An unchecked addition of more ALU units can result in idle units in the execution stage. Also, in ALU-intensive tasks, the ALU reservation units get flooded while others remain idle. Hence, a flexible scheme is proposed in this work, that takes into account the observation that value of operands of ALU operations is not always as large as that accommodated by the word length of the machine. In such cases, only a part of the execution unit is doing useful computation, while the rest has zero operands. Thus, truly speaking the utilization is not 100%. This work adds run-time flexibility to hardware modules for the purpose of accommodating as many instructions as possible in the execution unit. The exact extra hardware and logic required to do this is designed and implemented, while the general concepts associated with the addition of flexibility are described below. These can be applied in any form to any application.
Block Slicing
'Block slicing' is the process of splitting a block into multiple modules. A logic circuit that operates on 1-bit operands can be called as a unit. N units are interconnected to form N -bit modules to operate concurrently on N-bit operands. In all implementations, N is known or is pre-set and the interconnection network between units that form the modules is static in nature. When operands of varying lengths are encountered, the value of N is required to be dynamic. In order to allow N to be a dynamic value determined at run-time, it is necessary to make the interconnection network flexible. The network may be built to be completely flexible, but it is impractical to reprogram it before execution of every instruction. Instead, a degree of flexibility can be allotted to it. For this, m units are connected together statically to form m-bit functional units. We refer to each m-bit unit as a slice, capable of operating on m-bit operands. In a contemporary processor, if N-bit functional modules are present, then there will be N/m slices in a sliced architecture. The interconnection network between slices can be designed to be completely flexible, so that each slice can operate independently, or connect itself to more slices and operate concurrently with them. When two m-bit slices operate independently, they are capable of executing two instructions simultaneously, provided the operands are m-bit. When two slices connect together, they form a 2m- In an execution unit sliced into m-bit slices, slices are allocated to each instruction based on the resource requirements of ready instructions. There are two functions associated with the process of allocation before the instructions are ready to be executed: directing the operands into the correct operand register slices, and directing the result correctly into an N-bit output register. These functions can be performed by using decoders at the input and output of the execution unit. A truth table for the decoder can be easily developed and implemented as the internal circuitry for the decoder. Different execution units need different decoding functions as can be seen from the architecture explained in the next section.
Sliced ALU Implementation
The pipeline stages in a sliced ALU are shown in Fig. 1(a) . Resource Mapping is done by a unit called the ResourceM apper. Its latency is equivalent to a few logic gates, and can be included in the dispatch pipeline stage instead of a separate stage.
Resource Mapper
This determines the number of slices required by an incoming instruction and allocates slices for all incoming instructions. For determining the number of slices required by an instruction, the resource mapper performs a function called 'zero-checking'. It determines the length of significant bits in both operands and returns the maximum of these two lengths as the number of slices required by the instruction. This can be achieved simply by using AND gates. The zero-checking function is slightly different for the shift operation, for which not only the number of significant bits of first operand are required, but also the value of the second operand. Using these values and a simple logic circuit, the number of units required by a shift instruction can be determined.
With each reservation unit is associated a register called the Resource Allocation Vector (RAV). The Resource Allocation Vector keeps track of slices allotted to the instruction stored in a reservation unit. In addition, the Resource Mapper uses a global register called the Resource Vector (RV). If there are m slices in the execution unit, then the RAV and RV are m-bit. Each bit in the RAV and RV indicates a status for slices of the execution unit as allocated/not-allocated. When a slice is allotted to an instruction, the bit in the respective location of the slice is set to 1. When an instruction finishes using the slice, the bit is reset to 0. The Resource Mapper can also issue an instruction to one or more slices of functional units and set one or more bits at a time in the RAV of the instruction and global RV respectively.
If the execution units are all known to finish the execution of an instruction in one clock cycle, then a global Resource Vector can be assumed to be an all-zero number at the beginning of every clock cycle, and is redundant. In this case, allocation is done by examining all ready instructions waiting for a resource and determining the number of slices required by each. In the situation where the ready instructions need more slices than available, the instructions can be prioritized based on instruction count and other instructions can be stalled. De-allocation is not necessary here. The Resource Vector will only be needed if some instructions take longer than a clock cycle to finish. Though unused in this work, the use of Resource Vector has been proposed in view of future work, one instance of which is when integer slices are rearranged into a floating point pipeline, with a latency of more than one clock cycle. Fig. 2 shows the block diagram of a sliced ALU, while the flowchart in Fig. 1(b) shows the basic steps in which a sliced ALU functions. The Enable signals in Fig. 2 are fed to D-flip-flops so that only the appropriate part of the ALU functions, while the other parts retain their values. This leads to lower power consumption. The architecture is explained in detail in the next section.
Architecture of integer execution units
The architecture of a sliced integer unit (8-bit slices) is proposed in this section. The integer unit comprises of an adder/subtracter unit, a shifter, a logical unit and a comparison unit. The decision to use 8-bit slices in this architecture was based on the trade-off between inter-slice circuitry overhead and performance gain. Experimentation with different slice sizes may be performed before design. Fig. 4 to include four slices of the adder/subtracter unit, each capable of operating on two 8-bit operands, resulting in a 32-bit sliced ALU. The block diagram of this flexible adder/subtracter unit is shown in Fig. 4 .
Adder/
As explained before, there are two functions associated with slice-allocation: Directing the operands into the correct operand register slices, and Directing the result correctly into an N-bit output register.
The input operands are initially present in N-bit operand registers. If an ALU instruction with two input operand reg- The adder/subtracter units along with input and output decoders constitute the complete flexible adder/subtracter. Area analysis for this module is made in section 3.4.
Compare Unit
The compare operation is required to be performed on both signed and unsigned operands, and requires a slightly different treatment for each. This comparator can be designed as a minimal-delay circuitry, or it can be designed with minimal area constraint, depending upon the constraints imposed by the system. Fig. 5 shows the use of such comparison units in a sliced comparator design. The 1-bit control signal takes the value 0 for unsigned comparison and 1 for signed.
Once sliced comparison is performed, the final result of compare operation is determined by a separate logic circuitry that takes into account the respective outputs of each compare slice. The control signals for compare operations and resource allocation vector for each instruction are made available to this circuitry. The final bit output of the compare unit is concatenated with (N-1) leading zeros and returned as an output of the comparator unit. Cout7 S7   X0  Y0  X1  Y1  X2  Y2  X3  Y3  X4  Y4  X5  Y5  X6  Y6  X7 
Area analysis
The sliced ALU design requires additional hardware for decoders, multiplexers and added signals. For the implementation of 2:1 multiplexers used extensively in the design, transmission gates (pass transistor logic) can be used. These are designed using an NMOS and a PMOS transistor in a configuration that result in no static power consumption. The pass transistors add three NMOS and three PMOS gates to the hardware. To estimate the hardware used for decoders that perform direction of input and output signals into correct register slices, the average cost of decoders was computed in terms of logic gate equivalents. Table 2 lists the additional hardware used by various units in a sliced ALU. The additional hardware introduced for implementation of slicing is minimal. On performing a delay analysis, the maximum delay path of decoders is found to be equivalent to three gate propagation delays. Thus each decoder adds a nominal delay to the execution datapath.
Architecture Implementation
In order to evaluate the block slicing concept in a processor, it was implemented in a DLX pipeline using VHDL (VHSIC Hardware Description Language). The DLX architecture was designed by Hennessey and Patterson as a representative architecture of practical processors. This section explains the architectural design of the DLX machine and the architectural implementation of the scheme.
DLX Architecture
The DLX is a simple 32-bit load-store architecture described in [3] . The operations supported by the DLX are classified into four major types: ALU, branch, load-store and floating point operations. The control instructions are jumps and branches, where branches are conditional which need to be evaluated before the branch is resolved. The floating point unit of DLX handles all floating point operations as well as integer operations of multiply and divide. The scalar, pipelined implementation of DLX consists of five stages: Instruction fetch, instruction decode and register fetch, execute and effective address calculation, memory access and write-back stage. It can be extended to a superscalar pipelined version using general superscalar concepts. The number of pipeline stages and their functions remain similar.
Superscalar, Pipelined DLX implementation in VHDL
We implemented the superscalar VHDL version of DLX as a two-width, five-stage pipelined, 32-bit architecture. It is capable of executing integer arithmetic and logical operations, compare, shift, jump and branch instructions. It does not contain a floating point unit. The architecture uses an instruction cache to store instructions loaded from memory. Fig. 6(a) shows the pipeline stages in this implementation.
Each stage can process two instructions simultaneously. Once valid operands are fetched in the dispatch stage and an instruction is ready to begin execution, the number of units required for the instruction is computed from the value of the operands. This is done by a zero-checking unit. The Resource Mapper then allocates execution unit slices to an instruction. In addition, the resource mapper also sets the control signals that slice an execution unit appropriately. The resource vector is a bit vector that indicates the slices allocated to an instruction. For example, if instruction A is allocated slice number 1, then its 4-bit resource vector will be 0001. For instruction B with allocated slice numbers 2 and 3, the resource vector is 0110. Thus, the global resource vector during that clock cycle is 0111, indicating that only three slices of the execution units will operate, and the fourth slice will consume idle power. Data is loaded into the operand registers at the rising edge of the clock. Due to block slicing, the resource mapping control signals slice the execution unit and the ALU gives at most two outputs (ALU Output A and ALU Output B) by simultaneous execution of two instructions. These results are stored into their respective reorder buffer entries, and forwarded if necessary for the next clock cycle. The fetch stage is set so as to fetch the next instruction when an instruction is issued to an execution unit. Thus, when parallelism due to slicing exists in a program, the fetch stage is also speeded up and the total time of execution of a program decreases. In the absence of any additional instruction-level parallelism, the time of execution of the program remains the same as that in a non-sliced processor, since instructions are prioritized by program order. 
Evaluation
The usage of an integer ALU unit was studied by running several benchmarks on a VHDL implementation of the DLX superscalar processor. Table 3 shows the results obtained.
The VHDL program takes a text file containing machine codes as input. It can be simulated using Active-HDL 7.1. Benchmark programs are usually present as assembly-level programs. Such benchmark programs for DLX cannot be directly used as input to the VHDL program. Fig. 6(b) shows the data flow diagram while using the VHDL DLX processor emulator code. Benchmark programs with extension .asm are first converted to a text file with extension .out using a DLX assembler program called dlxasm [1] available freely. The dlxasm assembler converts DLX instructions into respective DLX machine codes. Each machine code is indexed by a 32-bit memory address in which the instruction is expected to be stored in a true hardware system. Format of the .asm and converted .out file is given in the Appendix. The .out file is used as input to the simulator Table 3 . Initial usage of ALU units in benchmarks engine that contains the VHDL code. The simulation engine produces waveforms for signals that propagate in the processor. These are in the form of a Value Change Dump (.VCD) file and can be easily viewed using a waveform viewer.
Results
The performance criteria used for evaluating the concept of slicing are speed-up, throughput, utilization and power. These criteria are widely used for comparison of different architectures. To evaluate the performance of the block slicing concept with respect to these factors, a hardware code for the DLX processor was developed using VHDL and tested with benchmark programs. Benchmark programs were obtained from various sources from internet resources. These were assembly level programs written for the DLX machine. .asm files containing benchmark programs were converted to .out files using the package dlxasm [1] and then run on the VHDL code of the sliced processor. Instead of developing the code from scratch, the freely available VHDL package dlx-vhdl [4] was used as base code and it was suitably modified for the proposed architecture. Throughput is given by number of instructions completed per unit time. It can also be related to the number Instructions Per Cycle (IPC), where the unit of time is a clock cycle. Considering that a new instruction is fetched every clock cycle, the number of fetch cycles indicates the input stream to the architecture and the number of instructions committed per fetch cycle indicates the output stream of the processor. The throughput is then given as:
The speed-up is computed with respect to the DLX architecture without the processor modifications for block slicing. Thus, speed-up is given as:
Resource utilization at the bit-level is given by the % of resource used during time of execution. Resource utilization can be given in terms of the ratio of number of times the resource slices were completely used to the total number of times the resource was accessed. Power consumed during execution of two sequential operations is evaluated using the Xilinx Xpower tool that is included with Xilinx ISE. The power-delay product is then used to compare the non-sliced and the sliced architectures. Table 5 . Efficiency Table 4 presents the speed-up obtained for the benchmark programs by listing time of execution of each benchmark on a non-sliced and sliced processor and using Equation 2. Table 5 presents the efficiency of use of ALU slices. In a non-sliced implementation, each time the ALU is accessed, both potential slices are accessed. In a sliced ALU, each time two instructions are executed in parallel, they are assumed to use two slices each, resulting in entire length of ALU being used. Let, Table 6 shows the throughput of both implementations in terms of instructions per fetch cycle. For estimation of power consumption, the Xilinx XPower tool was used with synthesizable designs of sliced ALU and non-sliced ALU. The ALU is capable of performing addition/subtraction, shift, compare and logical operations. Every combination of two different operations was selected and simulated with worst case 16-bit operands. The operations of addition and comparison were found to consume most power. The ALU designs were then analyzed for power consumption during execution of the operations of addition and comparison of 16-bit operands sequentially on a non-sliced ALU and parallelly on a sliced ALU. Table 7 shows the power-delay product (PDP) during this analysis.
Conclusion
The concept of resource slicing was implemented in the DLX processor using VHDL. Sliced resources process greater number of instructions without the need to add extra hardware resources. The sliced resource implementation was evaluated with respect to speed-up, throughput, power and utilization of the integer unit. From the results thus obtained, it can be observed that by addition of one low-latency stage, the Resource Mapping and minimal hardware, it is possible to obtain a speedup and higher efficiency of execution. The number of functional units required to be pipelined in a superscalar pipeline can also be reduced if the task running on the processor allows it. For a generic processor that runs a variety of different applications, each requiring different number of functional units, this can provide a flexible scheme for efficient execution. It is necessary to evaluate the performance enhancement obtained at varying superscalar widths on more benchmarks than used here. This will help in determining the optimal number of slices required for different applications. This number can then be used to design sliced processors for most efficiency. Block slicing is a general concept that can be applied in a variety of forms to modules other than functional units. It may be applied to registers and caches. It is required to design a suitable hardware to address, identify and access sliced data when stored in sliced registers and caches. A complete sliced processor will be obtained once work is performed for slicing these modules.
