In this paper, we present a new design-space exploration algorithm, the architecture explorer ( A E), for analyzing performance/cost tradeoffs in memory-intensive applications. A E evaluates FU, bus, andmemory cost for a series of performance constraints to produce a performance/cost tmdeoff curve. Unlike previous approaches, A E handles both hierarchical and non-hierarchical memory architectures with various speeds of memory.
Introduction
High-level synthesis (HLS), the process of automatically producing a register-transfer (RT) level design from a behavioral description, is currently an important area of research in design automation. Many HLS methodologies have been proposed, most of which advocate a three-step approach to the synthesis problem [SI. The first step is allocation, in which RT-level components such as functional units (FUs), buses, and memories are selected to implement the design. Next, scheduling assigns operations to control steps, and finally, binding maps operations to specific RT-level component instances. The output of allocation, scheduling, and binding is an RT-level netlist for the datapath and a control unit specification.
The HLS process can be complicated and timeconsuming due to the conflicting goals of synthesis (minimum area, minimum execution time, minimum power, etc . . .); therefore, designers using HLS may explore few alternatives before settling on a final design. To alleviate this problem, researchers developed design-spuce exploration (DSE) tools to quickly suggest many different alternatives to the designer and help him/her select an initial design which is ''closen to satisfying the requirements. This initial design can then be modified slightly or refined, either manually or automatically, to generate the final design.
In general, DSE tools view the design space as a plane with delay on the x-axis and area on the y-axis [8, 1 0 , 14, 151. A design is defined as a point in the plane with a delay value and an area value. Tools explore the design space by performing a series of area (allocation) estimations for a fixed delay and/or a series of delay estimations for a fixed area (allocation). The important goal in these approaches "TYE is to estimate area by optimizing the number of FUs, registers, and interconnect units. Unfortunately, these tools do not take into account the required memory or memory hierarchy frequently employed by designers to reduce cost when large, high-speed memories are needed.
We present a new DSE algorithm, Architecture EXplorer (AE), for memory-intensive descriptions. Unlike the previous approaches, A E estimates memory cost for both hierarchical and non-hierarchical memory architect u r f s in addition to FU and bus cost. So, AE can be used to determine the "best" memory hierarchy for the design.
AE's main contributions are as follows.
Memory Hierarchy -AE handles memory hierarchy where direct communication is permitted only between adjacent levels of memory.
Different Speeds of Memory -In AE, different memories may have different access times and these access times may take any number (one or more) of clock cycles.
Simultaneous Estimation of All Resources -
This allows the designer to see the tradeoffs between FUs, memories, and buses so he/she can determine which one dominates the design cost.
User-Controllable Design-Space Search -In AE, the designer controls the range and granularity 
Previous Work
Many design-space exploration techniques have been proposed in the literature [a, 10, 14, 151. These approaches estimate area by optimizing the number of FUs, registers, and interconnect units. Although they have been very successful for small problems, they cannot be applied directly to memory-intensive descriptions, which are characterized by large array variables and complex control flow, since it is inefficient to map all array elements into registers.
Several memory synthesis tools have also been proposed; however, they are not directly applicable to memory-intensive applications [l, 2, 7, 11, 13, 161. For instance, the techniques in [I, 2, 111 are designed to reduce wiring area by using register files or n-port memories instead of distributed registers; however, only scalar variables are permitted in the input descriptions, and memory hierarchy is not allowed. In [13] , array variables are permitted, but the memory model is still not hierarchical. The algorithms from [16] synthesize Silage descriptions, where each variable denotes an infinite stream of data. The goal is to optimize the size of storage elements according to the data dependencies in the description. In [7] memoryintensive applications are scheduled onto a fixed target architecture with a datapath, an off-chip memory, and an 1/0 buffer. The goal in this work is to minimize the size of the 1/0 buffer. Communication is permitted only between adjacent levels. For instance, data cannot be transferred directly from the level 3 memory to the level 1 memory. Instead, we must move the data from level 3 to level 2 and then from level 2 to level 1. Based on this model, an architecture allocation consists of the number L of memory levels, and (1) the number, type, bitwidth, delay, and number of stages (if pipelined) of FUs, (2) the number, bitwidth, and delay of buses, and (3) the number of words, bitwidth, number of ports, and delay of memory for each level I , 0 5 1 5 L. Note that the type of an FU refers to the functions it can perform. For example, and ALU which performs ADD and SUB operations is one type of FU while an adder which performs ADD operations is another.
Architectural Model
The cost of an architecture allocation Caa is defined as is the FU library, nu is the number of units of type U, and fu-cost(u) is the cost of unit U. The cost of the memory at each level 1 given by CL,, = pi *bi*wi*mem-cost(l) where pi is the number of ports on the memory, bi is the bitwidth of the memory, wi is the number of words in the memory, and mem_cost(l) is the cost per cell, specified by the user, for the memory at level 1. The bus cost at each level 1 is given by CLu, = ni * bus-cost(1) where ni is the number of buses at level 1 and bus_cost(l) is the cost, specified by the user, for buses at level 1.
Algorithm Outline
The goal of AE is to explore the design space from the fastest, most-expensive design to the slowest, minimiimcost design. This is accomplished by performing a series of cost estimations for different delays.
The inputs to AE are as follows.
Behavioral VHDL description -The description may contain complex control flow such as nested conditionals, case statements, and loops (bounded or unbounded), as well as both array and scalar variables.
Clock period -The designer must specify the clock period in nanoseconds. The output of AE is series of architecture allocations for various execution times. Figure 3 shows a flow chart of the AE algorithm. The idea is to estimate the minimum cost architecture allocation for each execution time in the range. Due to the hierarchical memory model, estimation requires two steps. The first step, which maps the variables in the description to the "most appropriate" level of memory, is explained in Section 4.1. The second step, which estimates an architecture allocation using the variable-to-level mapping, is explained in Section 4.2.
Variable-to-Level Mapping
The algorithm pseudo-code for variable-to-level mapping is shown below. The basic idea is to map large, infrequently-accessed variables to the slow levels of memory and small, frequently-accessed variables to the fast levels. The access frequency of a variable is the number of times it is referenced during the execution of the description. The size of a scalar variable is 1 x the bitwidth of the variable, while the size of an array variable is the number of elements in the array x the bitwidth of a single array element.
In line 2 of the pseudo-code, we estimate the access frequency of each variable statically using 50% branching probability for conditionals, (100/n)% branching probability for n-way case statements, and 90% branching probability for loops. Line 3 computes the priordty of each variable, which is defined as its access frequency divided by its size. In line 5 , we sort the variables in order from minimum to maximum priority, and finally, in lines [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] we map each variable (in sorted order) to the slowest level possible such that the execution time constraint is not violated. The complexity of this algorithm is O (V(V + E ) ) where V is the number of nodes in the CDFG and E is the number of edges.
Step 9, testing whether the execution time constraint is satisfied, takes 6 ( V + E ) time, and we must add on a factor of O ( V ) since step 9 is in a loop. 
Else (14) End if (15) End for End Algorithm
During the level mapping process, all input and output variables are mapped, by default, to the slowest level of memory; however, the user can change this specification if he/she chooses. Note that the variable-to-level mapping completely determines the data-transfer-to-level mapping as well. After variable-to-level mapping, new nodes are added to the CDFG to represent the additional memory accesses and data transfers needed to preserve the memory hierarchy.
Resource Estimation
The goal of resource estimation is to determine the minimum number of resources needed to implement the design within the given execution time constraint. In AE, resource estimation is done separately for each basic block of the initial behavioral description, and the results for basic blocks are combined into a solution for the entire description. In this paper, we only explain resource estimation for basic blocks since the method of combining basic block solutions appears in [8].
The inputs for resource estimation are listed in Section 4 (behavioral design description, FU library, etc . . .). In addition, an execution time constraint E and a variable-tolevel mapping are given. Since the variable-to-level mapping is known and the FU library is simple, each data flow node is bound to a specific resource type. For instance a memory read node may be bound to the memory at level 2, a data-transfer node may be bound to a bus at level 1, or a operation node may be bound to an ndderlsubtractor. However, we still need to determine the required number of resources of each type.
The algorithm for estimating the required number of resources is listed below. This algorithm must be executed once for each resource type R. Resource The algorithm Estimate-Resource performs a binary search to determine a lower boundon the number of resources of type R. We know that the minimum possible number of resources is 1 and the maximum is N , where N is the total number of nodes in the data flow graph of the basic block. We can perform binary search on the sequence of numbers 1 . . . N to determine the required number of resources. In order to do binary search, we need the procedure Feasible-Schedule(m) which tells us whether or not there is a schedule for the data flow graph using at most m resources and E clock cycles, where E is the execution time Note that the problem solved by Feasible-Schedule(m) was originally defined in [15] , and solved using a linear programming approach. Our approach is faster since the complexity of linear programming is dependent on many factors such as precision of the solutions, etc . ... Finally, the worst-case complexity of our algorithm, Estimatellesource, is lower than the complexity of any previous DSE algorithms [8, 10, 14, 151 since most of these approaches use ILP solvers which take exponential time or force-directed scheduling which has complexity O( N 3 ) .
Experimental Results
AE has been implemented in C on a SUN SPARC 2 workstation. The following experiments show how AE can determine the best memory hierarchy for memoryintensive descriptions. The examples used in the experiment,s include (1) the centroid computation (CENTROID) from an industrial fuzzy logic controller design [6] , (2) the inverse discrete cosine transform (IDCT) from [4] , and (3) the k h a n filter (KALMAN) from the high-level synthesis benchmark suite [3] . These examples are memoryintensive since their behavioral VHDL descriptions contain both complex control flow and frequent array-variable accesses.
For each of the examples, we varied the memory hierarchy and memory delay to observe the tradeoffs in cost Table 1 lists areas, delays, and bitwidths (BW) for the different FUs in the 3pm CMOS technology, as determined by the estimator from [12] , and Table 2 gives the estimated cost per cell for various speeds of memory. Bus delays are assumed to be 2ns and bus cost is 1. The search range is unrestricted, beginning with the fastest, most expensive design, and ending with the slowest, minimum-cost design, and the search granularity is 1, 5, and 10 for the CEN-TROID, IDCT, and KALMAN examples, respectively. Tables 3, 4 , and 5 list the minimum-cost memory configurations for the CENTROID, IDCT, and KALMAN examples at different execution times. The first column of each table lists execution time in clock cycles. The second column lists the configuration which minimizes total memory cost, while the third column shows the configuration which minimizes total (FU, memory, and bus) design cost.
The complete architecture explorations for the CEN-TROID, IDCT, and KALMAN examples (as opposed to the abstracted results listed above) appear in [9] .
Conclusions
Since memory cost often dominates total design cost, finding the minimum-cost memory hierarchy for a design is ail important problem. In fact, our experimental results show that, the least-cost design usually employs memory hierarchy because most of the variables are not accessed on every clock cycle. Also, determining the "correct" speed of the memory at each level is important, since sometimes different speeds of memory (50ns or 100ns) are more costefficient than others.
In summary, memory hierarchy is a cost-efficient design alternative to large high-speed memories; but there are many tradeoffs to evaluate (number of levels, speeds of memories, etc . . .). Therefore, automatic exploration of different memory hierarchies is among the most important problems to study in the CAD field.
