Abstract-In many applications, especially signal processing and matrix computations, algorithms are in a highly regular iterative form; access patterns for most variables are highly regular and uniform. Instead of always storing the values of variables back to and retrieving them from memory or register files, it will be much more efficient and cost effective to let those variables intelligently "stay" or "flow" in the data path for future use. In this paper, low cost and simple structured sequencers which are best exemplified by hardware stacks and queues are introduced in the data path for efficiently implementing such a novel concept. Various algorithms are developed to map variables to sequencers and to integrate sequencers into conventional high-level synthesis procedures. Experimental results show very encouraging improvement in the performance of designs as well as significant reduction in hardware cost.
I. INTRODUCTION
The high-level synthesis of digital systems from behavioral descriptions has gained a lot of attention from researchers in the CAD community during the last few years [1] [2] [3] [4] [5] . The two main steps in the synthesis process are the scheduling of operations in the graphical representation (e.g., DFG or SFG) to control steps (c-steps) and the allocation of hardware that implements the schedule [1] .
The allocation task is usually divided into three subtasks, namely, functional units (FUs) allocation, interconnection allocation and storage allocation [1, 2] . The problem of allocation has been addressed by many researchers including [3, 4, 5] . Most of these researchers have assumed a hardware model that includes a set of FUs, a set of interconnection buses, and a set of registers or register files. Based on this model storage allocation maps variables to registers or register files [2] [3] [4] . The assignment of variables to memory locations in register files offers the designer the random access advantage that allows the sharing of a memory location by more than one variable if their lifetimes do not overlap. Unfortunately, to enjoy the flexibility of random access capability, one has to pay the price of its undesirable attributes. If the number of variables is large, then the size of register files becomes relatively large, which not only adds more address generation and decoding hardware, but also leads to longer access delay due to decoding circuitry and long data driving lines. Furthermore, the controller becomes larger since we need more address lines to address the individual locations inside the register files. These problems motivate the search for new alternative methods for storage allocation.
In this paper we introduce the use of what we refer to as sequencers 1 as an alternative to register files. Sequencers, which are best exemplified by queues and stacks, depend on the sequence in which variables are written and read to guarantee correct data retrieval operations. Furthermore, sequencers do not include decoders and hence do not suffer from the disadvantages of random access memories.What motivates the use of sequencers is that after scheduling and FUs allocation, all operands to all FUs and all of the sequencing information about these operands (e.g., the control step at which an operand becomes available and the control step at which it is needed by an FU, etc.) are already specified. Therefore it is possible to arrange that the operands stay floating in the data path and move in a pipeline-like fashion between functional units and sequencers without the need for storing them in a random access memory element. Moreover, we have found that most algorithms of ASIC applications in signal processing and matrix computations contain a very high degree of regularity in the way variables are created and needed. Thus when, how and in what sequence the variables will be needed are usually very much predictable. Conventional approaches to high-level synthesis tend to ignore the fact that retrieval patterns of many variables are regular or very predictable, and always store them back in the random-access memory based register files. In contrast, the approach proposed in this paper always tries to take maximal advantage of the regularity and lets those "regular" variables stay in the data path by using the sequencers such that variables will be automatically available to FUs whenever they are needed; thus the number of memory accesses can be significantly reduced resulting in a better performance along with a major reduction of the size of the register files. Such a concept is similar to the way CRAY-1 supercompuer was designed, which utilizes the concept of pipeline chaining to allow data to stay as long as possible in the data path before storing them back to memories. Although some types of sequencers have been used by digital systems designers in an ad-hoc manner for a long time [15] , to our knowledge, this is the first time a general and a formal approach has been proposed to use them for the memory allocation step in high-level synthesis.
To give the reader an introductory idea about the proposed ap-1. The term sequencer has been used in the past in a different context to refer to the part of a control unit that drives a digital machine through the specified events that implements the execution of an instruction [7] .
1 of 6 31 st ACM/IEEE Design Automation Conference ® Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its date appear, and notice is given that copying it is by permission of the Association for Computing Machinery. To copy otherwise, or to republish, requires a fee and/or specific permission. © 1994 ACM 0-89791-653-0/94/0006 3.50 proach which will be discussed in detail later, consider the code sequence shown in Fig. 1 (a) that describes a very simple two input and one output digital device. The operations are allocated to a multifunctional ALU that can perform both addition and multiplication. The lifetime intervals of variables are shown in Fig. 1 (b) . A data path implementation which uses a queue is shown in Fig. 1 (c) .
In this paper we will restrict our discussion of sequencers to queues and stacks. We present formal definitions and solutions to the problem of allocating variables to queues and the problem of allocating variables to stacks. We also give a general strategy for applying the proposed allocation procedures to integrate with any conventional storage allocation scheme. The rest of the paper is organized as follows. In the next section the basic hardware concepts and model are presented. The allocation procedure is given in Section III. Examples and experimental results are given in Section IV and conclusions are presented in the last section.
II. PRELIMINARIES
We start by giving definitions of some of the terms that we used in this paper.
Definition 1: A sequencer is a collection of registers grouped together in a near neighbor connection manner that enables data to move through the registers in a pipeline fashion. In addition, individual registers are not randomly accessible, yet instead, data can enter and exit the sequencer through one or more of the "end" registers.
By virtue of their simple and regular structure, sequencers have the following interesting properties: a. The data transfer time (i.e., access delay) of the sequencer is almost the same as that of its individual registers and is independent of the size (i.e., the number of registers) of the sequencer. b. The number of control lines of the sequencer (e.g., push and pop signals in Fig. 2 (b) ) is relatively small and is independent of the size of the sequencer. Several examples of sequencers are shown in Fig. 2 . Figures 2 (a) and 2 (b) show a queue and a stack respectively. Fig. 2 (c) shows a bidirectional queue which can enter and output data from both ends, and thus can function as a queue or as a stack. These three types will be discussed further, later in this section. Fig. 2 (d) shows a bidirectional ring that enters and outputs data through one of the registers. The ring can be designed to perform multiple shifts in both directions to facilitate accessing data at different times. The designer can modify the designs of these sequencers or even come up with other types of sequencers that are tailored towards the nature of the application(s) to be implemented. As mentioned earlier we will only consider queues and stacks. The remainder of this section gives a detailed description of queues and stacks and their usage in our proposed hardware model. The proposed hardware model is organized as follows: A set of FUs execute the operations. Inputs and outputs to functional units can be stored in sequencers or in register files. The FUs and memory elements communicate through buses. If more than one bus needs to be connected to the input of an FU then they are connected through multiplexers. Outputs of FUs are connected to buses through tri-state drivers. The system uses a two phase clock. Our objective is to eliminate register files or at least make them as small in size as possible, by attempting to assign as many variables as possible to sequencers so that the whole data path will have an architecture that resembles a large pipeline consisting of the FUs and the sequencers. It should be emphasized that our approach can be applied to more sophisticated hardware models (e.g., models that support pipelined functional units, or functional pipelining, etc.). This simple model has been chosen for the purpose of illustrating the proposed approach in a clear and abstract manner.
Definition 2: A queue is basically a column of registers arranged in a shift register fashion so that it can function as a delay element of size n as shown in Fig. 2 
(a).
Basically if a variable v is allocated to a queue Q then v is stored in Q (i.e., is added at the head of Q) at v's write step and is dequeued (i.e., removed form the tail of Q) at the read step of v.
Definition 3: A stack is a collection of registers connected together in a bidirectional shift register fashion such that they constitute a stack that performs the push and pop by the left and right shifts or vice versa as shown in Fig. 2 
(b).
If a variable v is allocated to a stack then it is pushed on top of the stack at v's write step and is popped up from the top of the stack at v's read step. 2 of 6 Definition 4: A bidirectional queue is a collection of registers connected together in a bidiretional shift register fashion such that it can enter and output data from both ends and hence can function as a queue or as a stack as shown in Fig. 2 (c) .
III. PROBLEM FORMULATION
Given a set of variables S, where S={v 1 ,.., v n }, the objective is to find a mapping from S to a number of sequencers (i.e., queues or stacks) that optimizes a cost function (e.g., number of sequencers or total number of registers). We assume that scheduling and FU allocation have already been done. Therefore every variable v that belongs to S is described by the following information: (Label, Write
Step, Read Step, Source, Destination). The label is basically an integer identifier, the write step is the c-step at which v is defined (i.e., written by the source), the read step is the c-step at which v is read by the destination. The source and destination are integer values to represent the FU or I/O port that produces and uses variable v, respectively. If a variable v is written/read more than once then variable splitting [4] is used to convert it to a single write, single read variable.
Definition 5: A control sequence of a sequencer is the sequence of values assigned to its control lines during all the c-steps of the schedule.
Definition 6: A control sequence CS of a sequencer satisfies the write/read requirements of a variable v if applying CS on the sequencer guarantees that v is written in the sequencer at v's write step and is fetched from the sequencer at v's read step.
In general, two types of constraints have to be dealt with in allocating variables to sequencers. First, we have to make sure that there is no access conflict between variables allocated to the same sequencer. Second, we have to make sure that there is no control sequence conflict between variables allocated to the same sequencer (i.e., there is a control sequence that satisfies the write/read requirements of all variables allocated to a sequencer).
In what follows, we will address the problem of allocating variables to queues and the problem of allocating variables to stacks. The primary objective in both cases is to minimize the number of queues and stacks, which also helps to reduce the total number of registers and interconnect required. Detailed descriptions of algorithms and proofs of lemmas are omitted due to space limitations.
A. Mapping Variables to Queues
In allocating variables to queues, a variable v is added to the head of a queue Q by shifting it into Q, and is dequeued from Q by shifting it out of Q. Based on this scheme, the allocation procedure is somewhat complicated because the shift-register-like structure imposes some constraints on the mapping procedure. If a variable is mapped to a queue of size n then the number of c-steps between its write step and read step (i.e., its lifetime interval length = (read step -write step) ) must be greater than or equal to n because at least n csteps are needed to shift the variable through the registers of the queue. It should be stressed that the lengths of the lifetimes of variables allocated to a queue need not be equal. For example v 1 and v 2 in Fig. 3 can be mapped to the same queue of size 3 because the control sequence shown in the figure satisfies the write and read timing requirements of both of them. The necessary and sufficient condition for grouping a set of variables into a queue is as follows:
Lemma 1: A set of variables V ={v 1 ,v 2 ,...v k } can be allocated to the same queue of size l iff:
and the write step of v i ≠ the write step of v j ; 2. every variable v i in V has a lifetime interval that is greater than or equal to l; and 3. there exists a control sequence that satisfies the write and read requirements of all variables in V. Unfortunately, the mapping of variables to queues is computationally costly if variables of different lifetime interval lengths were examined one by one to see if they can be mapped to the same queue because the size of the search space becomes substantially large especially if the number of variables is large. Therefore, it is crucial to find a way to reduce the size of the search space in order to find a practical allocation method. The remedy of this problem comes by taking into consideration that we are dealing mostly with applications that exhibit a certain degree of computational regularity. In such applications variables are clustered into classes or "clusters" where the variables of each cluster have the same lifetime interval length and have distinct yet consecutive write and read steps. In our approach, we take advantage of this observation to reduce the search space and simplify the allocation process by dealing with clusters of variables instead of dealing with individual variables. The allocation procedure is divided into two phases. In the first phase, variables are grouped into clusters according to their lifetime interval lengths, such that all variables in a cluster have the same lifetime interval length, but have distinct write steps. In this way, all variables in a cluster need to stay floating (or to be "delayed") for an equal number of clock cycles. Moreover, since they have distinct read and write steps, they can share a queue that delays them the required number of clock cycles. Then, in the second phase, clusters of variables that do not have conflicting control sequences are merged and allocated to queues. We have adopted a simple sufficient condition for merging a group of clusters into a queue of size k, which is as follows: 1. No two variables at two different clusters have lifetime overlaps. 2. k is greater than or equal to the maximum lifetime density (i.e., the maximum number of live variables at any c-step) of all clusters and is less than or equal to the minimum lifetime interval of all variables in all clusters. It can be seen easily that the first condition guarantees that there will be no conflict in control sequences. The second condition guarantees that the size of the queue is large enough to hold all the variables assigned to the queue and small enough to flush all variables out at the required time. In this way a queue can be viewed as a kcycle delay element (or more precisely, at least k-cycle delay element). 
of 6
In most cases, the size of a queue can be further reduced (post optimized) after performing the initial allocation. The idea is to detect if there is a number of c-steps during which, no variable is shifted in or out of a queue. Then, an equivalent number of registers are removed from the queue. The control sequence compensates for the missing registers (delays) by freezing (i.e., not shifting) the new queue an equivalent number of c-steps.
B. Mapping Variables to Stacks
The stack allocation approach is based on the notion of stack compatibility. A compatibility relation is established between variables that can share the same stack. In the following paragraphs we will introduce this concept and show how it is used in the allocation approach. Therefore, the problem can be represented graphically as illustrated in Fig. 4 
. Every vertex in G represents a variable v i in S.
There is an edge between v i and v j in G iff either condition 1 or 2 holds (i.e., v i and v j are stack compatible). Hence the stack allocation problem reduces to the graph clique partitioning problem, which is NP complete [8] . We have solved this problem using two approaches. In the first approach which is intended for small size problems, the solution is based on an integer linear programming (ILP) model. In the second approach, the problem is solved heuristically to produce suboptimal results in polynomial time [15] . In both cases, the size of a stack in the final design equals the maximum lifetime density of the variables of the stack. The two approaches will be presented next.
1) ILP Formulation for Mapping Variables to Stacks
The allocation of variables to stacks can be formulated as an ILP problem. The formulation is based on the following observations: 1. If a variable v i is not stack-compatible with another variable v k then they cannot be allocated to the same stack. 2. All variables should be allocated to stacks. 3. The number of variables allocated to a stack should not exceed the maximum allowed limit of each stack.
The 0-1 Integer Linear Programming Formulation:
The notation and terminology used in our formulations are as follows: Consider n variables that are to be allocated to m stacks. The number of stacks obtained heuristically can be used as a value for m. The maximum allowed size of a stack is M. The variables used in the formulation are the following: y j is a 0-1 integer variable associated with stack STK j such that y j =1 if STK j is required, otherwise y j = 0; (1≤ j ≤m). (1≤ i ≤ n, 1≤j ≤m).
The problem can be formulated as:
Constraint 1 insures that no incompatible variables will be mapped to the same stack. Constraint 2 insures that every variable will be mapped to a stack. Constraint 3 states that the number of variables mapped to a stack should be less than or equal to the maximum allowed limit for stacks.
The formulation can be illustrated by considering the variables shown in Fig. 4 as a simple example. There are three variables; thus the value of n is three. By applying the above formulation on the example assuming m=3, and M=3, we have the following:
Minimize y 1 + y 2 + y 3 subject to:
The solution obtained by this formulation is optimal when x 1,1, x 2,3 , x 3,3, y 1 , and y 3 are set to 1 which results in allocating v 2 and v 3 to a stack and allocating v 1 to another stack.
2) Heuristic Solution for Mapping Variables to Stacks
In this heuristic we try to minimize the interconnection cost by using a strategy of trying to group variables of the same source or destination (i.e., variables that are used or created by the same FU, and assign them to the same stack to maximize the sharing of interconnection between stacks and FUs). This strategy also helps to guide the search process, since compatible variables usually have a 
4 of 6 common source or destination. The algorithm basically builds compatibility classes one by one. It examines variables one by one to check if the examined variable is compatible with all elements of the current compatibility class. Variables to be examined are ordered according to their sources and destinations to give more preference to variables that can share interconnection. Finally, each compatibility class is assigned a stack.
C. A General Memory Allocation Methodology
Since the adequacy of sequencers allocation procedures is problem dependent because they actually try to detect and utilize access patterns regularity, we have suggested a general interactive procedure that can be used in order to assist the designer in applying the most suitable allocation scheme. Our global allocation strategy has three stages. In the first stage we try to allocate all variables to queues since they are the least costly choice. Then based on a rejection criterion (e.g. register utilization or size of queue [15] ), we reject some of the queues that do not meet the minimum allowed utilization. Those variables which have been allocated to rejected queues will be used as input to the next stage in which we try to allocate variables to stacks. Similarly those stacks that do not meet the minimum allowed utilization will be rejected. Next, compatible stacks and queues are merged into bidirectional queues [15] .The remaining variables can then be used as input to any conventional allocator to be allocated to register files. The main steps of the procedure are shown below.
Step 1. S = Set of input variables.
Step 2. Map S to queues.
Step 3. Perform post optimizations.
Step 4. S = Set of rejected variables.
Step 5. Map S to stacks.
Step 6. Merge compatible stacks and queues into bidir. queues.
Step 7. S = Set of rejected variables.
Step 8. Map S to register files.
IV. EXAMPLES AND EXPERIMENTAL RESULTS
The proposed algorithms have been implemented on a SUN SPARCstation I running SUN OS. The ILP formulation has been solved using the LINDO [6] package on a SUN SPARCstation I running SUN OS. To demonstrate the advantages of the proposed approach, various examples were used in the experiment. Benchmark examples of the 1988 workshop on high-level synthesis [13] were not included because they are mostly irregular or in an irregular form, and thus do not suit our approach that exploits regularity. Nevertheless we found a large number of applications that give excellent results under our approach. Two of which are presented in this paper. The CPU execution time for all of them is less than one second.
The first example is an IIR filter borrowed from [9] . The IIR filter is described by:
After scheduling the algorithm for the case when N=M=3, on a multiplier and an adder, we get the set of variables shown in Fig. 5 (a) along with their lifetime intervals. We start by trying to allocate these variables to queues which results in the following: r1, r2, r3, and r4 are allocated to a queue of size 7 which is post optimized to size 5 and v1, v2, and v3, to a queue of size 6 which is post optimized to size 3. The remaining variables are allocated to small size queues and hence have been rejected (assuming that queues of size less than 3 will be rejected). In the next step, the remaining variables are allocated to stacks. The results of this step is as follows. s2,s3, and s4 are grouped to a size 3 stack and x1, x2, and x3 are allocated to a size 3 stack. The remaining variable s2 is assigned a size one stack (i.e., a single register). The final design is shown in Fig. 5 (b) . Table I summarizes the results of applying our approach on this example, and compares them with the results of another allocation scheme that allocates to register files. In our comparison we have assumed that the access delay of a register file equals the decoding delay plus the delay associated with a register transfer operation which is denoted as c. A decoder with n outputs can be realized by (n-1) 1-to-2 decoders. The delay of a 1-to-2 decoder is denoted as d. The first two entries in the table represent address generation and decoding cost and the last entry represents access delay of memory elements. It is clear that our approach eliminates address generation and decoding cost and improves access delay.
The second example is an FIR filter used in [9] and [10] . The description of the FIR filter is given by:
A special case of the FIR filter (the 16-point FIR) is a well known example in the high-level synthesis literature [4, 12, 16] . To illustrate the advantages of our approach, we applied our method on the FIR example with several values of N and we used the more r1  r2  r3  r4  s4  s3  s2  s1  v1  v2  v3  x1  x2  x3 Cycle i
Cycle i+1
5 of 6 regular FIR representation given in [9] . The graphical representation of the FIR filter with N=4 is shown in Fig. 6 (a) . Fig. 6 (b) shows the lifetime intervals of variables when scheduled with a multiplier and an adder. The results of this example for the case when N=4 and for the general case when N=k are shown in Table II and are compared with those obtained by using register files. Fig. 7 shows the improvement in access delay obtained by using sequencers over using register files, versus N, assuming that the ratio c/d=4. The figure shows that our approach gives a significant speedup that increases as N gets larger.
V. CONCLUSIONS
A new method for storage allocation in high-level synthesis has been proposed. The concept of using a more sophisticated hardware model that contains what we term as sequencers has been applied to the synthesis process, and is aimed to achieve two main objectives. First, to improve the data transfer delay of storage elements since sequencers, which are best exemplified by queues and stacks, are mainly implemented as unidirectional or bidirectional shift registers, and hence they do not suffer from decoding delays that grow proportionally with the size of a register file. Second, to eliminate the cost of memory address generation and decoding. Furthermore, algorithms and procedures have been developed for allocating variables to stacks and queues and to integrate the proposed techniques into conventional high-level synthesis procedures. Experimental results for a number of DSP applications show very encouraging improvement in performance as well as significant reduction in hardware cost. We believe that this approach opens up a new unexplored frontier for the synthesis of high performance ASICs that have a high degree of regularity and require short clock cycles. (a) (b) 6 of 6 c-step
