Abstract { This paper describes the use of Unied System Construction tools under development at the University of Southern California. The goal of the project is to automate the construction of heterogeneous, applicationspecic systems. Key elements of the USC system include multiprocessor synthesis, multi-chip datapath synthesis, memory-intensive synthesis, and multi-chip partitioning. The tools were applied to design of an image compression chip set, and results of using these tools are reported on here. Our results are c omparable to manual designs reported in the literature.
Introduction
Communications, entertainment, and other electronic systems are in widespread use. These systems are generally multi-chip, heterogeneous, and application-specic. Chip-level synthesis tools are invaluable for the rapid production of such systems, and such tools are becoming available for general use. System-level tools can also be used to signicantly increase a designer's ability to meet a s c hedule along with a set of performance and cost constraints, but only a few of these tools have been available in the past.
The Unied System Construction (USC) project at the University of Southern California involves the production of an integrated set of system-level tools for synthesizing multi-chip, heterogeneous application-specic systems which meet cost, performance and power constraints. This paper presents the use of these system-level tools to perform a multi-chip design exercise, a JPEG image compression chip set. The focus of the USC project is on realtime systems, such a s e n tertainment and communication technologies, but does not exclude other applications requiring specialized system design. A block diagram of the system is shown in Figure 1 . { non-pipelined processors in a ring, { non-pipelined processors connected by a bus, { pipelined processors with point-to-point connections, multiple custom VLSI chips, communicating asynchronously, multiple custom VLSI chips, communicating synchronously, with common clock, and memory-intensive modules consisting of a custom VLSI chip and a separate memory chip. Many other styles of systems are currently under development. Once a style is selected, specialized tools are invoked to complete the design process. Ultimately, any custom VLSI chips which m ust be synthesized are then processed by the ADAM high-level synthesis system, which produces a cell netlist. This netlist is input to the Cascade Design Automation Chipcrafter Silicon Compiler, and a chip layout is produced.
The following sections give a n o v erview of each major style of design, in the order each w as applied to the compression example. The remaining sections describe the image compression system to be designed and detail v arious design activities conducted using the USC tools.
Synthesis of Memory-Intensive Systems
A subset of the USC tools performs automatic synthesis of memory-intensive application-specic systems, with emphasis on hierarchical storage architecture design. The storage architecture is closely connected to the datapath of the system, and isolating its synthesis from datapath synthesis may not result in an ecient solution. Therefore, the design of the datapath and storage architecture 
I
Figure 1: Block diagram of the Unified System Construction (USC) Project are coordinated in the USC tool set. SMASH [9] (Figure 2) , the tool set for memory-intensive design, is used for systems designed for specific applications, where the memory-access pattern is not only relatively fixed but also known before hand. This mostly-deterministic access characteristic helps us in being more specific, hence more efficient in our designs. The original MIMOLA system was the first system to make tradeoffs in the use of multiport memories [14] . Lippens et al. [13] describe techniques to perform automatic memory allocation and address allocation for high speed applications. They synthesize memory after the design of arithmetic units, datapath scheduling and allocation. IMEC's CATHEDRAL-II compiles multi-dimensional data structures into distributed dualport register files and single-port SRAMs. They use a polyhedral-based model for high-level memory management for linear, piecewise linear and data dependent signal indexing [7] . SMASH, however, combines the tradeoffs in datapath as well as storage architecture as explained below.
SMASH performs high-level synthesis of an integrated system consisting of a datapath and a two-level memory hierarchy, from a given behavioral specification and with constraints on cost and performance. The two memory levels are 1. On-chip foreground memory. This consists of two subparts: datapath memory to store the intermediate or temporary variables in the datapath, and I/O buffers to temporarily store the inputs and outputs to the datapath chip, allowing fast access and off-chip storage of the data.
2. Off-chip background memory. In our model this is the bulk storage required for the inputs and outputs. The synthesis is performed in the following two steps: First, datapath synthesis with operation scheduling is performed combined with scheduling of local data transfers to/from memory. As a result of this scheduling, constraints are placed on the memory structure. During the second step, the storage hierarchy design is completed, which includes determining the data transfers between different levels of memory hierarchy and completing synthesis of the storage structures. We ensure that each step ( of the stepwise construction of the system takes into account the next step by looking ahead so that the next step is not overly constrained. Global design parameters, like the memory bandwidth and timing constraints, are considered when constructing the partial design in each step, tying the whole synthesis process together. 
I I

Synthesis of Asynchronous Multi-chip Systems
In practice, we find that many DSP and other ASIC designs consist of multiple concurrent and interacting processes. Though high-level synthesis has received enormous attention over the years, most approaches were concentrated on synthesizing single process (one thread of control) designs. Synthesizing a design with multiple concurrent processes poses many new challenges. For example, since the processes interact with each other, the synthesis tool has to solve all the timing constraints imposed by one process on another concurrently [18] . Furthermore, resource allocation for each process on a chip cannot be done without taking into account the area versus performance characteristics of each process since the total resources taken by all processes on a chip are limited by the chip package. The goals of this research are to provide an integrated system for synthesizing multi-chip designs with multiple concurrent processes as well as to speed up the redesign of multi-chip systems. Figure 3 shows a flow chart which illustrates the approach used in this synthesis system. First, the multi-process specification is translated from VHDL to a synthesizable representation called the Design Data Structure (DDS) [3] . The next step is to perform a number of process transformations in order to trade off among hardware sharing, control complexity, communication overhead and cost. A process-level chip partitioner, ProPart [4] , is then used to find new cost-effective chip boundaries according to the up-to-date packaging library. In addition, the partitioner will distribute chip resources to the processes according to their performance-versus-area characteristics and determine the interconnection structure as well. Next, a concurrent approach for multiple-process synthesis is used to synthesize each process into its own datapath and control path. The objective is to meet the timing, area and performance constraints as well as to synchronize the communication among the processes. Finally, we use a hybrid symbolic/numeric simulation to verify the functional and timing correctness of the RTL implementation [5] .The RTL implementation is submitted to the ADAM system to obtain the final chip layout.
ProPart was used in the experiment of a JPEG image compression system to be described later in this paper. Unlike most of the previous behavioral partitioning approaches which focus on partitioning design behaviors at the operation level into a number of synchronized chips. ProPart tries to partition a set of sequential and/or concurrent behaviors into custom chips. There are several advantages to process-level partitioning. For example, there are far fewer objects at the process level than those at the operation level, which allows us to utilize much more comprehensive techniques like mixed integer-linear programming and at the same time to take into account more partitioning issues, like chip package selection and chip resource distribution.
SpecPart [19] is the first system-level behavioral partitioning work which elevates the objects to be partitioned to a higher level of abstraction (such as processes and procedures), and uses a group migration technique similar to the Kernighan-Lin algorithm for partitioning. A comprehensive survey of other behavioral partitioning approaches at the operation level has been done by Vahid [20] . 4 Multiprocessor Synthesis Multiprocessors of various styles can be synthesized using the SOS (Synthesis of Systems) [17] set of tools. The input to SOS is a specification in the form of a task flow graph. SOS decides on the number and types of processors to be used, the interconnections between them, and the schedule of execution of tasks onto processors.
Related work in the subject of system synthesis include: graph-based theoretical approaches [1] ; analytical modeling approach [10] ; and mathematical programming formulation [6] . The approach used by SOS considers a more general case than the ones studied previously, that is scheduling and allocation of tasks related by a precedence SOS tools use mixed integer-linear programming (MILP) to model the synthesis problem. Some of the tools can generate the MILP constraints automatically, and others currently rely on constraints written by the user. The tools rely on a branch-and-bound solver called BOZO [11] . Figure 4 shows the overall operation of the SOS tools.
5
JPEG Image Compression Design
Due to the bandwidth constraints imposed by both still and video image transmission, data compression is a key function. As a result, we chose to focus our design activity on a standard for still image compression, JPEG [21] , and to eventually expand the example to cover MPEG standards as well. We began with the synthesis of the DCT (Discrete Cosine Transform) function. The 2D-DCT was decomposed into repeated row-column 1D-DCTs prior to the application of the system-level tools. The 1D-DCT macro was synthesized first and used to construct a 2D-DCT, clearly a bottom-up step. SMASH was used to generate five schedules for a 1D-DCT macro from a behavioral VHDL description of the DCT described in the referenced article [8] . The module library used is shown in Table 1 . These datapath schedules with varying cost and performance are shown in Table 2 . SMASH also determined buffer-size and bandwidth requirement as shown in Table 2 .
The 1D-DCT schedules were then processed by the ADAM tool MABAL [12] to generate the RTL datapath netlists. These netlists were analyzed to obtain the area characteristic of the datapath as shown in Table 3 . The area for functional units, multiplexers and registers was determined from the netlists, and wiring area was estimated manually using a rule-of-thumb which we observed in our earlier experiments [16] Figure 5 shows the input JPEG-specification, the design flow used for this experiment and the output of our system. It is important to note that the design flow has some bottom-up portions, which represent the flow between the application of each tool, each of which operates in essentially a top-down fashion. Thus the design flow is both top-down and bottom-up. 
Layout Generation using ChipCrafter
Three 2D-DCT Multiprocessor Architectures 2D-DCT Layout (Chip 1) Figure 5 : Design flow for still inaage compression system example datapath delay was used to compute the performance for each implementation with a two-phase non-overlapping clock. The quantizer performance and silicon area were also estimated, and the parameters used are comparable to those reported in the literature [8] . We used parameters of an existing chip for the Huffman coding [15] . After estimating the performance and silicon area of all the parts in the compression system, it was partitioned by ProPart. The data used for each function in the system is shown in Table 4a . The package library used for ProPart is derived from a commercial ASIC library and is shown in Table 4b . 2. Delay is in ns.
3. Cost is a function o f area a n d pin capacities.
the second 1D-DCT design produced by SMASH. Note that ProPart placed the DCT and IDCT on separate dies, and lumped the remaining functions on a single die. Finally, we generated the layouts of the 1D-DCT macro and 2D-DCT chip using Cascade Design Automation's ChipCrafter (Figures 6 and 7) , and analyzed the area distribution (Table 5) . in Table 6 . Since we obtained the Huffman coding chip parameters from another source, they are only compared here to show that the parameters we are using are comparable to those in the literature. The die we did design, the DCT, has somewhat larger die size than the industrial chips, but the performance was comparable. The technologies used by the industrial chips was not mentioned in [2] , so we were not able to determine whether our 1. Table 6 : Chip-set parameters Architecture tradeoff study To search the design space for a wider range of implementations, we also applied the SOS multiprocessor synthesis tool to the first stage of the 2D-DCT task flow graph shown in Figure 8 . In this graph, tasks T1 ... T8 are 1D-DCTs which operate row-wise on the 8 × 8 array of pixels. Task T9 is a join-distribute operator, which indicates the second set of 1D-DCTs cannot start until the first are completed. Tasks T10 ... T17 operate columnwise on the results of tasks T1 ... T8. We assumed a macro-pipeline execution between the set of tasks T1 ... T8 and T10 ... T17. While tasks T1... T8 are being performed for frame i + 1, T10 ... T17 are being performed on the previous result of tasks T1 ... T8. The design space was searched for various performance constraints with the objective of minimizing the cost for the target architecture shown in Figure 9 . Cost/performance parameters predicated from the RTL netlists for the 1D-DCT implementations (Table 3) were input to SOS, so that it could choose from all three 1D-DCT implementations. Figure 9 : Target architecture for SOS We considered three possible execution-time constraints, where the execution time is defined as the time to compute T1 ... TS. All processors send results to the buffer memory over a common 8-pixel wide bus. The sets of processors found by SOS for various timing constraints are shown in Table 7 . These results clearly show the cost/performance tradeoff at the architectural level.
FqBuffer
Memory
Conclusions
The entire design exercise took less than a week of time for three graduate students familiar with all the tools, and able to write VHDL descriptions rapidly. We found during the course of our tool usage that the tools tended to be used in a bottom-up fashion. We also found out that our software was not designed to take advantage of already designed macros like the 1D-DCT, so coding changes were
