Most applications of general purpose VLSI processors are developed using high level languages. In these languages, information is generally handled in a structured form. Compilers generate a considerable amount of code to navigate through the data structures and considerable processing time is spent performing address calculations required to access the data structures. In this paper, an alternative to software address generation, a hardware memory recon guring unit or an address generation coprocessor is presented.
INTRODUCTION
Modern general purpose VLSI processors are designed with the objective of a high average performance for a wide range of applications. However, the e cient solution of some tasks demands the enhancement of the processing capabilities of the processor (e.g. oating point computations). In spite of the advances in microelectronic technology, the addition of functional units to the VLSI processor is restricted by the limited amount of real estate available. A viable alternative for increasing computer performance is based on the use of special purpose units which take some of the computation burden from the main processor. Floating-point processors, CRT controllers, disk controllers, and DMA controllers, are but a few examples of currently available devices. In some cases, these units are designed to perform certain restricted data processing tasks in a very e cient manner. In other cases, they help in the management of the computer system resources. For example, memory management units (MMUs) assist the main processor in controlling the elements of the memory hierarchy of a computer system, which consists of a cache, primary storage, and secondary storage.
Most present-day computer applications are developed using structured high level languages (HLLs) such as Pascal, Ada, etc. In an HLL, information is generally handled in a structured form, e.g. as procedures and data structures. In general, HLL compilers will generate a considerable amount of code just to navigate through the data structures. This fact aggravates the problem of the addressing overhead. E cient address generation for data structures is therefore a primary goal in order to obtain high performance in a computer system. This is particularly true in a Reduced Instruction Set Computer (RISC) environment where address calculations are almost exclusively performed in software since available addressing modes in a RISC machine are limited. Another area where address computation is important is that of digital signal processing (DSP). Signi cant e ort has been focused on the design of special purpose processors 1, 3] . More recently, Nwachukwu 12] has proposed an array indexing unit to be used for address generation in an array processor to produce a system which is more versatile than conventional FFT array processors. Although these machines each have their own advantages, each is designed for a particular set of tasks. The success of the current special purpose machines is strongly determined by the intended task domains. Hence the machines tend to be well suited for one particular application, but are usually in exible.
Instead of building special purpose architectures, some designers have concentrated their e orts on using general purpose processors to achieve versatility at the expense of having to generate addresses in software. One example is the fast Fourier transform algorithm and its many variations 2, 4] . Although these algorithms improve the e ciency of the various DSP operations, a signi cant amount of time is wasted in address calculations since a general purpose processor is used.
Objective
In this paper we rst describe a hardware unit that is tailored to function more e ciently with data structures associated with DSP applications. This device can in fact be viewed as a memory recon guring unit (MRU) or a coprocessor for accomplishing address transformations.
The objective is to design an MRU which is easy to interface between the host processor and memory and which would not require any modi cations either to the host processor or the memory. Special opcodes have not been introduced into the instruction repertoire nor does the design presume any modi cations to the existing operating system. The function of a unit such as the MRU is to provide the CPU with a set of specialized addressing modes as de ned by algorithms which are frequently used in several major applications.
Then the details of the VLSI implementation of the MRU are presented. The Octtools tool suite 13] is used for the design. The feasibility of a VLSI implementation is thus demonstrated.
Finally the performance of MRU is evaluated using popular signal processing algorithms such as convolution, correlation, FFT etc. These algorithms are selected because they utilize di erent address sequences. The performance of a computer system with the MRU and one without the MRU are evaluated to illustrate the speed factor contributed by the MRU.
PRIMITIVE ADDRESS SEQUENCES
The approach which has been used here is to classify a desired address sequence into a composition of a series of \primitive" address sequences. A di erent address sequence may then be generated by specifying a di erent composition of these primitive address sequences. The selection of the address sequences is based on a survey of frequently used algorithms for the solution of real-time DSP problems. They are summarized in Table 1 The MRU or address generation unit that is designed and implemented in this paper is capable of generating these addressing sequences in hardware. Figure 1 illustrates the manner in which the proposed MRU is interfaced to a processor. After the introduction of the MRU, the signal ow between the processor and memory is through the MRU and dictates that the MRU be transparent to normal signal ow. No appreciable delay should be introduced when instructions or unmapped data are to be fetched from memory. The system functions in one of three modes, depending on the address sent out by the CPU. (The term mapped data refers to data sequences which need to be accessed in speci ed addressing patterns while unmapped data refers to data which need not.)
DESCRIPTION OF THE MRU SYSTEM
The MRU is a memory mapped device in which each user accessible register is assigned a unique memory address. When the CPU writes to any of these MRU registers the MRU is placed in the \INIT" mode for one instruction cycle. The CPU thus initializes the MRU registers to set up the desired primitive address sequences.
If the host processor accesses data which is not to be mapped or if it fetches an instruction, the MRU is in the \PASS" mode and lets the signals, data, and address, to pass through to the memory/processor unaltered. Only a decode delay is introduced by the MRU in order to determine the memory space being addressed, whether mapped or unmapped. If a computer system with a Harvard architecture is being employed, the instruction memory could be directly connected to the processor with the MRU being introduced only in the data path. In such a case, even the decode delay will not a ect the fetching of instructions.
When the host wants to access mapped data it sends out a base address which is recognized by the MRU as one of the base addresses speci ed during initialization. The MRU enters the \MAP"mode for one instruction cycle. Each time a particular primitive address sequence is referenced the CPU will send out the same base address. The base address thus indicates the address sequence to be generated by the MRU and corresponds to the location at which the rst piece of mapped data is stored for this pattern.
The name memory recon guring unit or MRU is used because by specifying an addressing pattern in a short initialization routine (with the MRU in the INIT mode), and by specifying a base address whenever the address pattern is to be used (with the MRU in the MAP mode), the MRU makes the memory appear recon gurable. The MRU provides a set of specialized hardware addressing modes so that di erent addressing sequences may be generated.
HARDWARE DESCRIPTION
A functional representation of the address generation unit is shown in Figure 2 . Primitive Generation Units (PGUs) are the basic components in the MRU and are used to generate the speci c address pattern required. The Primitive Generation Unit is combined with additional circuitry that controls the PGUs to yield the O set Generation Unit. If several data structures are being simultaneously referenced in the same application, each of them will need an OGU. Several OGUs can be integrated to yield the O set Generation Module (OGM). Sometimes several arrays or matrices may be stored in contiguous memory and may be accessed using just one OGU, depending on the programming technique. But such techniques may not be applicable in all cases, and many applications require several OGUs. A Decode and Control Logic Unit (DCLU) determines the system mode and is also used to trap/regenerate signals to/from memory.
Decode and Control Logic Unit (DCLU)
The DCLU determines the MRU operating mode and provides control of signal ow between the host processor and the memory. The mode selection and data switching portion of the Decode and Control Logic Unit (DCLU), as shown in Figure 3 , determines the system mode by decoding the address lines sent by the host processor. Figure 4 is the functional diagram of the Signal Flow Control (SFC) unit which is responsible for coordinating all the signal ow. This unit basically performs a multiplexing function. For example, the addresses to memory come from either the CPU or the OGU depending on whether or not address mapping is to be done. The operation of the DCLU can be described as follows 8] 14].
PASS MODE: When an address associated with either an instruction or unmapped data is sent by the host processor, the DCLU decodes the address in order to determine that the MRU is in the PASS mode. Consequently the SFC allows the address, data and control signals to pass through with only a decode delay.
INIT MODE: If the address sent by the host processor indicates the INIT mode, the DCLU will send the subsequent data to the OGU in order to initialize the appropriate registers and counters. Typically the data includes the length of the sequence, the type of sequence, constant o set and the other control words. During the INIT mode, base address registers are loaded with base addresses by the SFC. The address indicating the INIT mode is sent out repeatedly until all data required for initializing the MRU have been written. MAP MODE: A mappable base address is decoded by the DCLU which compares the coming address with previously stored base addresses. If a match is found, the SFC intercepts the R/W signal and holds the data bus until the mapped address is generated. After the mapped address generated by the OGU is sent to the SFC, it enables the R/W signal and the data ow. Every time, the base address is received, the DCL activates the OGU and the appropriate segment of the PGU to generate the requested addressing sequence.
O set Generation Module (OGM)
The O set Generation Module (OGM) consists of several O set Generation Units (OGUs) (three in Fig. 2 ). Each O set Generation Unit consists of a Primitive Generation Unit (PGU), a control word register, an adder and a register. The control word register contains the control information necessary to activate each PGU.
The control word registers are loaded by the host processor while the MRU is in the INIT mode. In the adder, the base address from the base address register will be added with the element of the requested addressing sequence from the PGU to provide the nal mapped address. The nal mapped address is then routed back to the SFC in the DCL, which then generates the R/W signal and puts the address on the address bus and passes the contents of the data bus to/from the memory.
Primitive Generation Unit (PGU)
The PGU is the most essential part of the OGU. It consists of four di erent hardware units which are responsible for generating the di erent primitive addressing sequences outlined in section 2.
(1) Sequential addressing sequences can be generated by a simple parallel-input parallel{output up/down counter. (2) Sequential with o set (k) addressing sequences can be generated by an up/down counter and an adder/subtractor unit. One of the two operands is obtained from the sequential counter and the other is obtained from the constant o set k stored in a register which has been previously loaded while the MRU was in the INIT mode. (3) Shu ed and bit{reversed addressing sequences can be generated by a countermultiplexer combination unit as shown in Figure 5 . To generate shu ed addressing for a sequence of N = 2 n , the output of an n-bit counter is input to n, n-to-1 multiplexers. Each multiplexer is controlled by log 2 n control bits corresponding to the di erent base of the required shu ed addressing sequence and thus selects one of n inputs for generating an n-bit word. For the bit-reversed addressing sequence, the control bits in the unit are simply set to select the counter output bits in reversed order, i.e. counter outputs C n?1 ; C n?2 ; :::::C 1 ; C 0 are selected in the order C 0 ; C 1 ; ::::::::C n?2 ; C n?1 . (4) Re ected addressing sequences may be generated by a hardware unit as shown in Figure 6 . In the re ected address generation unit, the counter is incremented every other clock cycle. For a counter value \i", the i th element A i is fetched and on the next cycle, \i" is subtracted from the constant (N ? 1) to fetch the element A N?1?i :
VLSI IMPLEMENTATION
We implemented a scaled down version of the address generation unit described above. This version has 8-bit address and data lines. Our purpose is to implement the 8-bit version and then analyze the scalability of the design to nd out how much more complex a 32-bit version would be. We can thus study the VLSI feasibility of our design.
VLSI Tools
The chip was developed using VEM and the OCT tool suite 13]. VEM is an interactive graphics editor for designs represented using the OCT Data manager. VEM supports physical and symbolic editing of IC designs, as well as schematic capture. The OCT tools consist of a complete set of tools to facilitate the design process.
In OCT, the basic design unit is the cell. Each cell is an arbitrarily complex portion of the design and may include other cells in a hierarchical fashion. Cells can be initially described in a Pascal-like high level language bds. Then the bds functional description is translated to logic equations using bdsyn. The logic is optimized and mapped into standard cells by a multiple-level-logic synthesizer misII. Then wolfe will randomly assign locations for all terminals and pass the information to TimberwolfSC which is a standard cell place and route program. Minimizing the number of vias in a design will improve the yield and mizer performs this task.
BDSYN is a tool for quickly describing and implementing combinational logic. But it can be used for describing combinational logic blocks only (blocks which do not use signal latching of any kind). Describing a nite state machine requires the logic designer to add external latches to the described combinational logic blocks. If the designer wants to specify the logic from the gate level, logic gates and ip-ops, these can be interconnected with bdnet. Thus bdnet yields the exact hardware that is in the designer's mind. Again, during hierarchical design, bdnet is the tool used to integrate the sub-modules to yield larger modules and nally the chip. Mosaico is a complete set of tools for oorplanning, placement, and routing of macro-cells. Macro cells are functional blocks that are created by module generators, manual layout or any number of other means. Mosaico consists of ve basic steps executed in sequence as a pipeline: oorplanning and placement, channel de nition, global routing, detailed routing, and compacting. All these tools can be used to yield area e cient IC designs easily.
Implementation Details
The elements of the DCLU such as the Decode unit and the SFC section were realized using gates in the standard cell library. The standard cell based designs are easy to modify. The SFC unit is constructed in a bit-slice fashion to take advantage of the inherent regularity. After the SFC elements for one bit are constructed, the instances are just replicated to yield the SFC for the whole bus.
To construct the O set Generation Module, rst the four circuits to generate the di erent address sequences were independently realized. They were integrated together with a multiplexer to yield the Primitive Generation Unit (PGU). The various registers in the OGU including control word registers were wired using ipops from the cmos library and they were integrated with the PGU and the adder to yield an O set Generation Unit. The OGUs can then be replicated resulting in the O set Generation Module and the OGM can be integrated with the DCLU to obtain the nal chip. During the integration, the OGUs should be placed in such a way that the wires from DCLU to each OGU can be routed with minimum complexity.
The implementation of the MRU with one OGU is shown in Figure 8 . One OGU consumes less than 2mm X 2mm of chip area if fabricated using 1 micron technology. More OGUs may be integrated to the unit in Figure 8 to yield a full capability MRU. A layout with three OGUs, according to the oor plan in Fig. 9 is shown in Figure 10 . This version with 3 OGUs can be fabricated on a 4 mm X 4mm die, (assuming 1 micron technology).
VLSI Aspects
There are several important aspects to bear in mind when designing an architecture for VLSI implementation; the major ones are regularity, modularity, scalability, chip area etc.
VLSI designs should be regular structures since the time and e ort required to produce a design depends on the number of di erent elementary cells required more than on the total number of transistors used. For instance, it is quicker to produce a large register bank by replicating a properly designed single register bit than generating a random logic function which uses far fewer transistors. Random logic can also be implemented in regular structures such as the PLA or ROM, but an implementation based on gates has higher speed. The intermediate approach is to use a cell library that consists of standard gates, latches, ip-ops etc. The designer picks up appropriate cells and wires them together. This might not result in the smallest possible layout, but the speed can be good. Modular designs help in easy debugging and facilitate modi cations and future expansions. E orts should be made to reduce the area of a chip. The cost of manufacture of an integrated circuit depends on chip area (A). More important is the fact that die defects make costs increase more rapidly than O(A) and chips above a certain area cannot be practically manufactured. Pin requirements of a chip is another important determinant of chip cost. Every wire that connects a pin on the package to a bond pad on the chip is a potential source of a defect and the yield of packages with large numbers of pins is usually small. So chips with large numbers of pins are generally expensive.
The various elements of the MRU are designed keeping these aspects in mind. We exploited all regularity in building up the chip. The SFC Unit was constructed in a bit slice fashion to take advantage of the regularity. The SFC elements for a single bit were rst designed and then it was replicated to yield the whole SFC Unit. In the O set Generation Module (OGM) also, there are several copies of the OGU and associated hardware. Careful custom design of one OGU was done. This OGU may be replicated to create the OGM.
The design was done in a highly modular fashion. This helped in developing the chip, module by module, doing su cient testing at the module level before integrating and removing bugs at the sub{module level itself. Another factor that had to be taken care of was the I/O requirement for each sub-module. We designed modules in such a way that the area for interconnection could be minimized.
Regarding area/time trade-o s, reducing time was the main emphasis. Despite the fact that area had to be minimum, we traded area for speed wherever the choice had to be made. Since the device is to be introduced between the CPU and the memory, speed is of primary concern.
Another aspect in a VLSI implementation is scalability. It is particularly important here, since the original design was for a 32-bit processor, but only an 8-bit version was implemented. From the 8-bit implementation, we can speculate about the complexity of the 32-bit chip. The 8 bit DCLU can be expanded to a 32-bit version with linear increase in complexity. The 32-bit OGU can also be realized with O(n) complexity. Hence we may conclude that the chip can be constructed with linear increase in complexity. Another factor in scalability is the complexity with an increasing number of O set Generation Units (OGUs). Figure 10 incorporates three OGUs. More OGUs can be incorporated with a linear increase in complexity.
Designing circuits with testability aspects in mind is also important since testing costs can be a signi cant portion of the total device costs. The modular and regular structure of the chip makes fault tolerant design with redundant OGUs very easy. This chip is suitable for wafer{scale integration also. The number of OGUs to be incorporated in the OGM can be decided at the wafer{scale integration stage.
PERFORMANCE EVALUATION
The usefulness of the MRU device may be illustrated with several frequently used digital signal processing algorithms such as convolution, correlation, fast Fourier transform, and matrix multiplication. To illustrate how the MRU can be used for these algorithms, a detailed analysis of one of the algorithms (convolution), is provided. 
Algorithm Analysis

Performance Analysis
The four signal processing algorithms were implemented on an HP/UNIX system using assembly language. The time required for execution of the programs was calculated based on the assembly language program. The convolution and correlation programs deal with xed{point arithmetic operations. The FFT and matrix multiplication programs operate on oating{point data. The MC68020 microprocessor is used as the host processor and the MC68881 oating point coprocessor is employed when needed in both the cases with the MRU and without the MRU. Calculation of the execution time is based on the MC68020 and the MC68881 instruction execution times in terms of external clock cycles. There are three values given for each instruction 10] 11].
(1) Best Case { The best case re ects the time when the instruction is in the cache and bene ts from the maximum overlap due to other instructions. (2) Cache Case { When the instruction is in the cache but has no overlap, the value of the cache-only-case is employed. (3) Worst Case { The worst case re ects the time when the instruction is not in cache or the cache is disabled and there is no instruction overlap. Table 2 gives equations which specify the execution time for the four algorithms studied. T MRU and T NMRU represent the execution time in clock cycles for the system with the MRU and without the MRU respectively. The equations in Table 2 consist of N 3 ; MN 2 ; MN; Nlog 2 N; N; log 2 N; and constant terms. The N 3 ; MN 2 ; N 2 ; MN; and Nlog 2 N terms are due to nested loops with di erent boundary values. The N and log 2 N terms result from single loops. The length of the data sequences in the convolution, correlation, and FFT programs is N. The dimension of the matrix in the matrix program is MXN, with M not necessarily equal to N. The programs included in this study are not the only way to implement the algorithms. Therefore the values of execution time obtained are not absolutely exact. The execution times in systems with and without the MRU may be used to calculate the speed up contributed by the MRU. We de ne a speed up factor, T NMRU divided by T MRU , to indicate the improvement contributed by the MRU. Figure 11 illustrates the variation in speed up factor as the length of the data vector N varies. As shown in Table 3 , for large values of N, the speed up varies between approximately 1.5 and 2.4 for the four algorithms.
CONCLUSION
We have presented the design, VLSI implementation, and performance evaluation of an MRU, a hardware unit that performs the address generation task required during access of data structures, especially in signal processing applications. The operation of the device, the hardware modules involved, the design of the various functional units etc are described. Various addressing sequences involved in signal processing algorithms are analyzed and hardware necessary to generate them are designed. The whole design is targeted for a VLSI implementation and the various trade-o s involved in the design process are explained.
Details of a VLSI implementation of the MRU are then presented. A scaled down version of the design is implemented using the Octtools tool suite. Performance of the device is then evaluated with four frequently used algorithms from digital signal processing. The algorithms used are convolution, correlation, FFT and matrix manipulation. A computer system employing the MRU can speed up these algorithms by factors of as much as 2.4 compared with a system without the MRU. The high performance results from reduction in the overhead associated with repetitive calculation of addresses in software.
The MRU system can be modi ed in hardware to adapt to unique environments found in digital signal processing applications and also to more general demands required in general purpose computing.
