We detail a method for the partitioning of a single application specijied in synchronous datapow (SDF) into multiple independently-synthesizable, communicating VHDL hardware modules. Either synchronous or asynchronous communication is allowed, and the clock timing and control are automatically generated. We show that this method guarantees the preservation of correct functional behavior as specified in the original SDF graph, and that many choices of partitioning into multiple hardware modules are possible. The ability to break up a larger application into smaller synthesizable hardware modules can lead to efficiencies in hardware synthesis, which is faster when performed on smaller VHDL specifications. We illustrate this new method with some practical example applications that have been constructed in Ptolemy.
INTRODUCTION
In this paper we describe a method for partitioning a single application specified in synchronous dataflow [l] into a multimodule hardware implementation. This method is one way of bridging the gap between algorithm design and implementation design in the domain of digital signal processing (DSP) applications. It has the advantage of providing the designer with a method for breaking up larger designs into smaller, more tractable units, while still enforcing communication rules that preserve the functional correctness of the intended algorithm. This method has been implemented and tested in the Ptolemy [2] environment.
In our work, the algorithm is specified in a form of dataflow graph known as synchronous dataflow (SDF). The semantics of SDF graphs are more restricted than other, more general dataflow semantics, but SDF is still able to represent a very broad range of applications. The semantics of SDF graphs are particularly useful for representing multirate DSP applications [3], a class of applications for which hardware synthesis is especially difficult. In addition, the fully static nature of the SDF model allows us to analyze the application in advance to determine the partial precedence ordering of all computations and communications. This results in faster execution at run time since there are no datadependent decisions to be made. We will describe the SDF model in greater detail in Section 2.
The method discussed in this paper draws on techniques developed for software synthesis for DSP applications [4] . In particular, methods for parallel scheduling of SDF graphs on multiple processors [5] [6] and techniques for interfacing heterogeneous code generation subsystems [7] have provided the main groundwork for the methods in this paper. The parallel scheduling techniques have been aimed at generating software for parallel execution on multiple communicating DSP processors. The emphasis of the code generation mechanism has been on combining multiple, heterogeneous subsystems into a single parallel architecture. In this paper, we start with the same algorithm representation, a dataflow graph, but rather than partitioning and scheduling that graph for execution on a homogeneous or heterogeneous parallel target architecture, we are actually generating the parallel implementation architecture that will execute our application. In this respect this technique is similar to hardwardsoftware cosynthesis where both the processorhnstruction set and the software to be executed on the synthesized processor are generated [8] [9] . Unlike that class of methods, we do not explicitly synthesize an instruction set, but rather a sequence of clock and control signals. We also do not synthesize any explicit block of memory, but instead individual registers with input, output, and clock signals. Further design analysis following the application of our technique could be performed to map these registers into reusable blocks of memory, but we do not attempt to deal with memory management in the current implementation of our method.
Previous work for synthesizing parallel hardware from dataflow specifications [lo][ 1 I ] has relied heavily on library-based instantiation of existing design units. While these methods exploit design re-use to some advantage, they are more restrictive in the partitioning of applications into parallel hardware and they do not offer strong guarantees that the parallel hardware generated will be functionally correct and deadlock-free. Our method, while relying on postprocessing through hardware synthesis, guarantees that the generated hardware description preserves all the data precedences implied by the initial dataflow specification, and it also uses the static analysis afforded by the SDF model to ensure that the parallel hardware modules will not deadlock with one another at any time during execution.
The goal of this method is to output a description of a parallel hardware structure for correctly executing the application specified by the initial dataflow graph. In our implementation this output representation is a description in VHDL [12] code. VHDL is a widely-used language for representing digital hardware and is popular as an input to tools that synthesize high-level hardware designs into lower-level gate and netlist representations as well as tools that perform design optimization through various re-synthesis techniques. Our first concem in this paper is generating a complete description of functionally correct hardware. We generate VHDL code that is acceptable as input to hardware synthesis and optimization tools in order to take advantage of the wide range of techniques that take register-transfer level (RTL) VHDL as their design input. Because these techniques typically require time exponential in the size of the input design [13] , our method's support for resolving the design into smaller communicating design units can open the way to speeding up the process of synthesizing the final hardware implementation. Manually breaking up the design into smaller units is already a possibility with existing techniques, but dealing with the design time overhead and complexity of designing the necessary inter-module communication presents a real drawback to doing this manually. Our technique deals with this issue by having the correct synthesis of the communication be an integral part of the method, rather than a secondary activity that is merely necessary to enable the partitioning.
In the remainder of this paper, we first review the SDF model of computation in Section 2. In Section 3 we describe our technique and the procedure for generating parallel hardware from an SDF graph representation. Following that we discuss the implementation of our technique in the Ptolemy environment in Section 4. In Section 5 we show a few example applications that have been demonstrated in Ptolemy, and we conclude in Section 6 with some suggestions for future extensions of this work.
DATAFLOW
Dataflow is a graphical representation for computations where actions are represented as nodes and communications are represented as arcs between nodes. The data that is communicated is represented as tokens that Row over the arcs between nodes, passing the output of one action to the input of another. Arcs are ordered communication channels that can be modeled as first-in, first-out queues with no upper limit on their size. No actor node may perform a computation until sufficient input data is available, a restriction that makes the functionality of the overall graph determinate. This means that any implementation of the dataflow graph that still obeys the basic precedences implied by the original graph will be functionally correct. There are usually many total orderings of firings that satisfy the partial ordering of the precedence graph.
Synchronous dataflow (SDF) is a specialization of dataflow in which each node produces and consumes constant numbers of tokens on each of its input and output arcs at each firing. This property makes it possible to analyze any SDF graph to determine whether it can be executed without deadlock. It also allows such a graph to be analyzed to determine whether the numbers of tokens produced and consumed on each arc will be balanced in the long term, so that the graph can be executed indefinitely without building up unbounded numbers of tokens on any arc. Graphs that are both deadlock-free and balanced are said to be consistent.
Any consistent SDF graph can be expanded into a directed acyclic precedence graph (DAG) that has nodes that represent individual firings of actors in the SDF graph and arcs that represent individual data precedences between firings in the DAG. Such a DAG represents a complete cycle of computations of the SDF graph that can be repeated indefinitely without deadlocking or growing in its memory requirements. It is from this DAG that we will be generating our parallel hardware implementation of the SDF computation. 
PARALLEL HARDWARE GENERATION
The basic elements of an SDF graph are nodes, arcs, and data tokens. To be operationally useful, a schedule of firings of nodes is also necessary. Similarly, the basic elements of a parallel hardware implementation are functional execution units, signal lines, and memory registers. To coordinate the execution of the hardware, a control and clocking structure is needed. The generation of each of these elements will be discussed in the remainder of this section.
In our method we begin with a consistent SDF graph representing the computation we wish to implement in parallel hardware. From that SDF graph and a valid sequential schedule indicating which nodes to fire in which order, we can construct the DAG, which will show all individual units of computation and the data precedences between them. This is done by firing the schedule in order and constructing the nodes of the DAG as each firing takes place. When a given firing occurs, the input data that it requires is noted and the source of that data is noted. The data that the firing produces is also noted so that it can be passed to downstream firings that require that data. In this way, only the actual data precedences are used to establish which firings must take place before which others. More than one downstream firing may require data generated by the same upstream firing. Constructing the DAG in this way serves to identify data tokens that are shared inputs to multiple firings.
Once each firing is noted and its inputs and outputs are known, it can be translated into an equivalent element of hardware. In order to do this, there must be available some kind of equivalent hardware representation for the computation of one firing of the node. It can be a pre-defined structure of combinational logic or a block of register transfer level (RTL) code that can be translated into a logic structure. In our case, we use a meta-syntactic representation, which gets translated into a block of RTL VHDL code. Based on the input and output data from the firing, the VHDL code block is instantiated and connected to signals that cany the input and output data for that firing.
As each firing in the schedule is constructed, it is noted in a list from which the hardware controller will be generated. On the outputs of each firing hardware element, the output signals are latched using registers. The clock that actuates those registers is generated by the controller in the correct firing order from the original schedule. In this way, output data from firings is latched at the conclusion of each firing. The time between clocks will be determined by the latency of the firing hardware elements once they have been synthesized at the gate level.
So far we have been describing the construction of a single hardware structure from the DAG, but we are ultimately interested in generating a parallel hardware implementation. A key property of the DAG is that it is acyclic, so that firings in the DAG depend w P1 was ready to receive it. If we refine our hardware implementation with detailed knowledge of the latency of each firing element, then we can schedule the communications in terms of counts of a single system clock. One firing may take N clock cycles to complete, after which its output registers will be latched and downstream firings or communications to other hardware units can take place.
In our current implementation of this method in Ptolemy, we generate asynchronously-communicating hardware units only, but with the addition of annotations of firing latencies from synthesis, we will be able to generate synchronously-communicating implementations also. The full implementation in Ptolemy is described in the next section. only on previous firings, never on downstream firings. Because of this property, a partitioning of the DAG onto two or more hardware units will not introduce deadlock if the firings and communication operations on each hardware unit are executed in the same order as they appear in the DAG. This means that no firing should be scheduled for execution before all the other firings on which it depends, even indirectly, have been scheduled. If an arc in the DAG is split across two hardware units, then the source firing for that arc should be executed on the source hardware unit, followed by the send operation for that arc. On the receiving hardware unit, the receive operation should be executed first, which must block until the data is received. Once the data is received, the sink firing for the split DAG arc can be executed. In this way, the precedence constraints are preserved, even across communication links in the parallel implementation, and deadlock is not introduced.
Therefore, to construct a parallel implementation we need to designate which firings should be executed on which hardware unit. Once that mapping is decided, then the data precedences implied by the DAG will determine the order in which firings and communications between hardware units should take place. When the final parallel system is executed, a hardware unit that is waiting for input data from another must block its execution until that data becomes available. Since the DAG is acyclic, there will always be at least one hardware unit that is not waiting for data and is executing.
Before synthesis we do not know exactly how long individual firings will require to complete their execution. Because of this, we cannot know the absolute times when communication actions will need to take place. At first we only know in which order the communications should take place, based on the DAG precedences. This initially puts us in the position of only being able to specify asynchronous communication for our parallel hardware implementations.
For a synchronously-communicating implementation, we need both the firing order and the firing durations, as well as the amount of time required for communication operations. If we had an upper bound on the latency of signals propagating through any firing element, then we could set the system clock to have that clock period, and the communication times would follow from the sequence of firings that each hardware unit performs. Assuming dedicated communication channels, each communication could take place at the earliest time after the source of the data had finished computing and the destination of the data
IMPLEMENTATION IN PTOLEMY
Ptolemy is a software system for the simulation and prototyping of heterogeneous systems. It supports many models of computation, including dataflow and discrete-event. In the area of prototyping, a number of facilities for code generation are supported. Among these is the VHDL domain for generating VHDL code from systems with SDF semantics.
The VHDL domain can be used for generating sequential VHDL code that uses no signals, only variables, and describes the entire application in a single process within a single entity. Another option is to generate structural VHDL code that places the actions of individual firings in separate VHDL entities that communicate through signals.
Either of these styles of code can be simulated using commercial VHDL simulation tools. One of the simulators supported in Ptolemy is the VHDL System Simulator (VSS) from Synopsys [14] . Other simulators could be used with some additional programming by a user knowledgeable in C++. In addition to simulating the generated VHDL code by itself, there is support for specifying heterogeneous systems that include portions in VHDL, as well as in C and assembly code for the Motorola 56000 DSP. Code for each of these elements can be generated and simulated together using facilities for automatic interface generation between these diverse execution elements. Another facility for circuit synthesis from structural VHDL code is also supported using the Design Compiler from Synopsys.
To implement our method in Ptolemy, we use structural VHDL code generation, where we add in the necessary registers between firing elements, with control and clock signals to trigger the execution of the firings in the correct order. The system is manually partitioned by the user by designating for each block on which hardware unit that block's firings should be executed. Each VHDL subsystem is under the control of the CompileCGSubsysrems target, a software task that manages the generation of separate entities for each hardware unit and orchestrates the communication between them during simulation.
Following validation of the partitioned system under simulation, the VHDL descriptions of the separate hardware units are passed to synthesis in order to obtain a netlist structure to implement the generated hardware units. Through this synthesis process an optimized architecture is obtained. If changing the initial partitioning is needed, the user can adjust the partitioning manually and repeat the simulation and synthesis steps again until the desired improvements in results are obtained.
EXAMPLE APPLICATIONS
The method discussed in this paper was run on a small set of designs, two of which were tested for synthesis results. The first was a system containing two FIR filters operating on parallel streams of data. The second was a two-stage perfect reconstruction filterbank, where both the analysis and synthesis stages each contained two FIR filters as well as up/downsample and addition operations. Each of the designs was input to Synopsys Design Analyzer, both as a single-processor system and as a two-processor system generated from the VHDL domain in Ptolemy. For the parallel filtering system, synthesis took 125 seconds on a SparcStation 20 running Solaris, and the resulting design contained 1,733 gates. When the system was partitioned in two, the total time for synthesizing both halves, one after the other, was 138 seconds and the gate count was 1,788. Partitioning this design gave no advantage in synthesis time or in the size of the synthesis result, but the inter-processor communication was generated automatically with nearly no additional design time. For the filterbank, a singleprocessor design generated from Ptolemy took 520 seconds to pass through synthesis, with a gate count of 6,249. When the system was partitioned into a two-processor design, the total synthesis time was only 452 seconds with a gate count of only 5,834. In this case, breaking a complex multirate design into multiple, smaller designs yielded savings in synthesis time and in the size of the resulting netlist. These results are promising, but much more careful and thorough testing is required to evaluate the impact of this design method on circuit synthesis results for a broader range of designs.
CONCLUSIONS
In this paper, we have introduced a method for generating parallel hardware systems from applications specified as dataflow graphs. The complexity of managing the inter-module communication is automatically handled by the method. We have shown that this method can reduce the time required to perform circuit synthesis on the generated design while not incurring a significant design time penalty in working out the communication between separated modules.
We plan on extending this work so that pipelined hardware units can be generated as easily as the current sequential hardware units while still preserving functional correctness. We will also explore the reuse of functional blocks within hardware units for the purpose of allowing area-speed tradeoffs and supporting a wider range of design choices to the user of this method. Another possibility is extending this method to work with broader models of dataflow that have some degree of data-dependent control, such as boolean dataflow and dynamic dataflow.
