Our approach to digital system simulation compiles a high-level system model into a highperformance simulator that consists of software and hardware components. The target architecture for the simulation compiler is a tightly coupled processor and fieldprogrammable gate array. We describe the simulation compiler and show how it can be used to improve simulation performance by up to a factor of two over an all-software simulator.
more abstract to manage the complexity. Simulation techniques must keep up with these upward shifts in the abstraction level to ensure that systems designers, who use simulation extensively, can efficiently evaluate the performance and correctness of their ideas as early as possible in the design process.
Recently, a trend has developed toward specifying systems using hardware description languages (HDLs). This trend is partly due to the widespread acceptance of logic synthesis tools and, to a lesser extent, high-level synthesis tools. When coupled with the much greater demand for simulation performance resulting from the increase in system complexity, this trend motivates the need for new simulation techniques that are optimized to simulate high-level system models as efficiently and as economically as possible.
We describe and evaluate a simulation approach that converts an HDL model into a high-performance simulator consisting of tightly coupled s o h a r e and hardware components that execute on a processor and field-programmable gate array (FPGA) architecture. Our approach uses compiled-code software simulation, accurate performance estimation, logic synthesis, and software-hardware partitioning and scheduling to generate these components.
Rachid Helaihel Jeremy Levitt

Ricardo Ramirez
Stanford University
A software-hardware simulator
For a software-hardware simulation of HDL models, the system architecture must be a good target for compiled HDL models. Also, the simulation compiler must be optimized to make the HDL models simulate efficiently on the target architecture.
Simulation architecture. Figure 1 shows a target architecture composed of a processor and one or more FPGA chips connected by the same bus as the processor cache. This implementation detail is important because it means that the communication latency and bandwidth between the processor and cache and between the processor and FPGA chips are comparable. The FPGA chips accelerate HDL simulation by simulating certain parts of the model faster than they could be simulated using a general-purpose CPU. Control sections of HDL models typically fall into this class. In addition, the parallel execution among the CPU-based and FPGA-based parts of the HDL model accelerates simulation even further.
Though the primary purpose of FPGAs is to accelerate simulation, designers can also use the simulation architecture as the control and data manipulation processor in an embedded system. In such a system the FPGAs might also serve as the interface logic to sensors and actuators. For certain applications, this system might provide sufficient performance that the simulator itself could replace a custom hardware implementation. Figure 2 provides an overview of a simulation compiler that obtains input from an HDL system specification and then converts it into an intermediate form represented by a hybrid abstract-syntax tree and a dataflow graph. The compiler uses the analysis of the synchronization and data dependencies in the intermediate form to extract the concurrency in the specification. This information provides software-only and mostly hardware versions of the simulator.
Simulation compiler.
By accurately estimating the performance of these two versions of the simulator, we can determine the execution time of a particular part of the HDL model in hardware or in software. This data provides the basis for partitioning the model between the CPU and the FPGA. The partitioning and scheduling algorithms attempt to maximize performance with a given number of FPGA chips.
One of the key features of the simulation compiler is that it can produce efficient simulators for a processor only, or for a processor and FPGA; furthermore, the compiler only will optimize the simulator for the number of FPGA chips specified. By varying the number of FPGA chips, designers can change the cost and performance of the simulator.
Software compilation
To generate software, a Verilog-to-C compiler (VCC) compiles a Verilog' HDL model into a C program. Verilog, designed for fast event-driven simulation, contains constructs that have no clear analogs in hardware or procedural programming languages. By restricting the Verilog programs that VCC will accept to those that describe synchronous digital systems, we guarantee that VCC can compile the Verilog program.
The output of VCC is a statically scheduled C program that has the same behavior as the Verilog model running on a event-driven Verilog simulator but achieves much higher simulation performance. Previous work in compiled-code was based on input descriptions that were gatelevel netlist representations of circuits2,* or register-transfer level models in a simple description l a n g~a g e .~ VCC advances the work in compiled-code simulation because it generates a compiled-code simulator from Verilog. Compiling Verilog programs requires a significant amount of analysis that is not required for input descriptions with simple execution semantics.
The analysis performed during the VCC compilation process requires three main steps:
Dataflow graph construction parses the Verilog program and produces a dataflow graph that is combined with an abstract syntax tree. Static scheduling analyzes the dataflow graph and generates an execution schedule for the graph that preserves the semantics of the Verilog program C code generation produces C code.
CPU cache bus
I Mainmemory I Dataflow graph construction. The input to VCC is a synchronous Verilog description. Verilog programs are composed of modules and have a syntax that is similar to C but semantics that are very different. These modules may be instantiated inside of other modules to create a hierarchy that represents the structure of the hardware system. In our Verilog programs, modules contain two types of concurrent process statements or concurrent blocks: always blocks and continuous assignments. An always block is a group of statements with sequential semantics that executes whenever an event occurs that is in the blocks activation list. A continuous assignment contains a left-hand side that continuously reflects the current state of the variables on the right-hand side.
Related work
Previous work in the area of sofmare-hardware cosynthesis has focused on the design of embedded systems.
Here. we discuss only those cosynthesis approaches that stan with a single program specification of the system and perform the partitioning between software and hardware automatically. Cosyma system. This approach' uses a superset of C called c" to describe the functionality of the system to be synthesized. C' extends C with timing constraints and tasks. The target architecture of the Cosyma system is a microcontroller and coprocessor that communicate via a shared memory. Cosyma maps a C" description onto the target architecture by assigning basic blocks of the description to run in software on the processor or in hardware on the coprocessor. The partitioning is based on a cost function that includes the estimated benefits of moving a particular block from software to hardware and the estimated communication cost. A simulated annealing algorithm performs the actual partitioning. The advantage of the Cosyma approach is that the use of a general-purpose programming language makes it easy to describe complex systems. However, because the approach does not currently overlap the execution of the microcontroller and the coprocessor, it does not exploit one of the main performance enhancements possible with 3 coprocessor. Furthermore, the use of estimates of software performance does not accurately account for the effects of an optimizing compiler on the performance of the software. This results in software-hardware systems with poor speedup and sometimes a slow down over the software-only system.
Gupta and De
MkheH. This approach2 is similar to the one we present. It uses an HDL called,HardwareC to speclfy the system description. HardwareC includes program constructs for speclFying delay and execution rate constraints as well as constructs for explicitly expressing concurrency. The goal of the system is to reduce the cost of implementing a system using an ASIC by combining an off-the-shelf microprocessor and an ASIC. The automatic partitioning begins with an initial partition in which all program construck5 with unspecified delays are placed as software and the rest as hardware. An iterative improvement algorithm moves operations from hardware to software to reduce the cost o f the system, while meeting the delay and rate constraints As in the Cosynia system, the performance of the software is only estimated: no actual code is executed. This sort of software performance estimation makes it difficult to get an accurate measure of software performance and does not include the effects of optimizing compiler technology.
Parsing the Verilog program, flattening the hierarchy, and creating a directed graph creates the dataflow graph, whose vertices represent variables, always blocks, or continuous assignments. The edges of the graph represent uses or definitions of the variables. A directed edge between a variable vertex and a concurrent block vertex for each variable appears on the right-hand side of a continuous assignment or in the activation list of an always block. Another directed edge appears between each concurrent block vertex and each variable vertex that is assigned in a concurrent block.
Static scheduling.
To make static scheduling of the concurrent blocks possible, the compiler converts the potentially cyclic dataflow graph into a directed acyclic graph (DAG) by breaking the feedback loops in the dataflow graph at the clock signals. Clock signals identified as variables appear in a special always block named clock. This clock block also defines the clock signal transitions and is the only block that can contain delay statements. Every always block in the program depends upon at least one clock signal transition. The always block vertices and continuous assignment vertices associated with a transition must form a DAG in the dataflow graph. If they do not, no static schedule of blocks is feasible; VCC will reject the Verilog program.
Selection of the first clock transition in the clock block generates a schedule for the process statements. The effect of this transition on other signals is propagated by constant propagation of the transition to all the dependent blocks in the dataflow graph.5 The always blocks and continuous assignments that are fired directly or indirectly by the transition are then scheduled by topologically sorting the dataflow graph.6 This process repeats for all the clock transitions in the clock block.
C code generation. VCC attempts to generate C code that requires as little runtime as possible. To achieve this, it generates C code so that it can be optimized by the C compiler. VCC performs optimizations such as packing multibit signals into a single machine word and eliminating bit field selection that requires specific knowledge of Verilog semantics. Standard C compiler optimizations like constant propagation, dead-code elimination, and common subexpression evaluation are left for the C compiler. However, VCC does perform a substantial amount of copy propagation to eliminate redundant assignments resulting from the structural hierarchy in the model. Emitting the code for a clock transition and for each of concurrent blocks appearing in the static schedule for that clock transition generates C code. This process repeats until code has been generated for all the clock transitions.
A simple example. Figure 3 shows a simple Verilog description of a clocked state machine. It has a single module called top, which contains an initial block that initializes the variable state and two always blocks. The special clock always block defines the clock signal phil, and the positive edge of phil activates the state-machine always block. Figure 4 shows the dataflow graph that corresponds to the state machine description. The clock always block defines the phil variable, which is used by the state-machine always block, while the state-machine always block defines the state variable. Since this dataflow graph is already a DAG, scheduling the evaluation of the vertices is straightforward. Figure   5 shows the C code that results from the scheduling and code generation phases.
Software performance estimation. The performance of the all-software simulator is impressive; in our example Verilog models have achieved speedups of between 150 to 300 times faster than Verilog-XL1.6.' Experience with this simulator indicates that other software optimizations will further improve Verilog simulation performance; however, we focus here on the additional performance improvements that are possible using software-hardware cosynthesis.
The granularity for partitioning into software or hardware is a block of Verilog code that does not contain event control. This ensures that a block assigned to hardware will execute to completion without requiring intermediate communication with blocks running in software. Event control-free blocks include continuous assignment blocks and simple always blocks. Complex always blocks with nested event control are analyzed and split into simple event control-free blocks. The simulation compiler's software estimation step determines the average and the shortest execution times of the C code associated with each block. This requires a careful analysis of the object code produced by the C compiler for the C statements associated with each Verilog block. To make this feasible, the C compiler cannot interleave statements from different blocks. This restricts the scope of its optimizations. In practice, most C compiler optimizations are unaffected, and the resulting code executes almost as fast as the fully optimized code.
Profiling of the software simulator to estimate the average execution time of the statements associated with each Verilog block is more accurate than static estimation. Profiling is more accurate because it captures the dynamic frequencies with which branches are taken in the code, especially if the behavior of the model with profile input accurately reflects the real behavior of the model. If inefficient profiling techniques are used, information collection could take a very long time.
The profiling technique used to obtain the average software execution time is the object code annotation tool called Pixie.8 This tool captures the dynamic execution frequencies for each basic block in the object code. By analyzing the execution time of each basic block on the target processor architecture and multiplying this time by the execution frequency of the block, Pixie provides an exact count of the number of cycles executed by each basic block. This very efficient profiling technique slows down program execution time by only a factor of four. It also accurately accounts for pipeline stalls, though it does not account for the performance of the memory hierarchy. Our simulator compiler calculates the software execution time of each block by adding the execution cycles for all the basic blocks in the object code that are associated with the block.
To estimate memory-hierarchy performance, Pixie generates an address trace, which is then used to drive a cache simulator that models the processor memory-hierarchy performance. However, unlike CPU performance, it is almost impossible to assign the time spent in the memory hierarchy to individual blocks. Fortunately, in our experiments, performance losses in the memory hierarchy were small. We suspect that with much larger models the memory hierarchy could have a significant effect on software performance.
To determine the shortest execution time of a block, the simulation compiler analyzes the object code produced for each block. It determines the shortest path length through a block by assuming that all instructions execute in one cycle and by finding the sequence of taken or untaken branches that minimizes the number of instructions executed in the block.
Hardware synthesis
FPGAs are an ideal implementation medium for hardware simulation because they allow the design to be changed easily. The Xilinx FPGAs used in our experiments consist of an array of configurable logic blocks (CLBs) that are interconnected by a hierarchy of routing channels and surrounded by a perimeter of programmable input/output b1ocks.l
The processor-FPGA interface. With current FPGA densities, it is possible to emulate complex logic designs on multiple FPGA chips; however, many of these implementations cannot fully use the FPGA gates because there are insufficient 1 / 0 pins.'" To overcome this problem, we designed a memory-style interface in which the FPGA chips communicate with the CPU over common address and data buses. Figure 6 shows a block diagram of the interface between the functional units in the FPGA and the CPU. The FPGA is organized as a register file that feeds operands to independent functional units that implement the operation of a particular block. Because the functional units are independent, once the operands have been loaded, all functional units can operate in parallel. Even though the data bus is 32 bits wide, the number of bits in each register corresponds to the width of the input operand it contains. In addition, functional units with common input operands can share registers. The variable register bit widths and register sharing among functional units reduces the FPGA resources needed to implement this organization.
To activate a particular functional unit, the CPU places the data on the data bus, then places the address of the register on the address bus. The decoder in the FPGA will tum the internal data bus into an input bus and enable the addressed register to latch the data on the bus. After the functional unit executes, the CPU can redd the output by applying the address of the appropriate output. The decoder enables the addressed tristate bus and drives the output from the internal bus onto the data bus. Since the FPGA is based on SRAM technology, we assume that the time it takes to read or write one of the registers in the FPGA is on the same order as an SRAh4 chip access time. With a 50-MHz CPU, we allot one clock cycle for each CPU-to-FPGA access.
I
Synthesis tools. We used logic synthesis tools from Synopsys to convert Verilog blocks into FPGA hardware." The input description to the synthesizer for each block includes the interface logic that will be connected to it on the FPGA. This ensures that the area and critical path delay estimates reported by the synthesizer represent the true costs of placing the block in an FPGA chip.
The synthesis tools cannot synthesize every block. The synthesizer translates register transfer level descriptions into gate level descriptions, but cannot handle certain behavioral Verilog statements. Our Verilog models are intended for simulation and consist mainly of behavioral code; inevitably, the synthesizer cannot find a gate level description for certain blocks. Although a high-level synthesizer would be less restrictive in the types of Verilog statements it could translate, it too would not be able to completely translate all Verilog programs to hardware. The reason is that Verilog can be used to describe systems that do not have any reasonable allhardware representation.
Performance estimation. The synthesizer reports the area cost in terms of CLBs and the critical path delay in nanoseconds for each synthesized block. To minimize the synthesizer's runtime, it does not use timing or area constraints. Without these constraints, the FPGA synthesizer will not optimize the blocks for speed or area. To account for the delay of routing wires (which represents a large fraction of the delay in an FPGA), the FPGA synthesizer uses a statistical approximation method based on a simple wire load model before the CLBs have been placed and routed.
Cosynthesis
Once the execution times for all blocks in both software and hardware and the area cost of the hardware for all blocks have been estimated, each block must be placed in either hardware or software so that the overall execution time of the simulation is minimized. This is a difficult problem because it requires the combined solution of partitioning and scheduling subproblems.
The partitioning of blocks between software and hardware affects scheduling in two ways. 1) The time required for a block to execute depends on whether the block is placed in software or hardware.
2) The execution overlap between software and hardware blocks depends on the placement of all the blocks. This makes it difficult to evaluate the performance of a particular partition without considering scheduling at the same time. Yet, it is impossible to produce an execution schedule without first selecting a partition.
Our solution to this aspect of the cosynthesis problem is to efficiently find a near optimal solution to the scheduling subproblem and then use this scheduling algorithm in the inner loop of an iterative partitioning algorithm.
Software-hardware scheduling. This step creates an execution schedule that attempts to minimize execution time.
__
The algorithm starts with a predetermined partition of blocks between hardware and software and uses the dataflow graph to produce a schedule for the CPU. At any point during simulation, the CPU can be in one of four states: executing a software block, writing arguments to a hardware block, reading results from a hardware block, or waiting for a hardware block to finish. Our algorithm attempts to find an execution order for hardware and software blocks that minimizes the time the CPU spends waiting for hardware blocks to finish. Since the amount of CPU time spent in the other three states is fixed, minimizing the waiting time also minimizes the total execution time.
The dataflow constraints between blocks restrict the order in which blocks can be executed. Since the CPU either executes each block entirely or writes its arguments and reads back results, the CPU imposes an order on the execution of blocks.
Since the CPU executes software blocks serially, they are easily ordered. Hardware blocks, however, asynchronously execute in parallel with each other. The CPU enforces a correct order of execution for hardware blocks by serializing communication with the FPGA. Maximizing the parallel execution among hardware blocks and between software and hardware blocks minimizes the time the CPU must wait.
Finding the optimal solution to the software-hardware scheduling problem is intractable; however, using simple heuristics, our algorithm quickly finds near-optimal approximate solutions for the problems encountered in practice. Since the optimal solution can be no better than a solution in which the CPU is never in the waiting state, we can bound the error between our approximate solution and the optimal solution.
Our scheduling algorithm makes the following assumptions:
Once a software block begins to execute, it willfinish without interruption. The execution of a software block cannot be interrupted to read the results from the FPGA or to write arguments to the FPGA.
Where necessa y , no-operation instructions (NOPs) ensure that the CPU neuer reads results from the FPGA prematurely. These NOPs execute regardless of the dynamic execution behavior of software blocks.
CPU-to-FPGA reads and writes always take two cycles.
The two cycles result from the assumption that in addition to the one-cycle transfer time between the CPU and FPGA, the CPU also must load the argument from, or store the result in, memory. Sometimes another instruction must execute to calculate the source or destination address. Other times a load or store between the CPU and memory is not necessary because the value already exists in the register file or the CPU immediately uses it.
A 50-MHz CPUclock frequency determines the number of cycles hardware block take to execute. Scheduling algorithm. List scheduling,I2 the basis of our simple algorithm used to schedule the software and hardware blocks, is computationally inexpensive and an effective technique for a large class of problem^.'^ If at any point it is possible to communicate the arguments for more than one hardware block, our algorithm gives priority to the hardware block with the longest execution time. The rationale for this 54 /€€€Micro heuristic is to allow the slowest hardware blocks to execute while the CPU is communicating arguments to the FPGA and reading back results for the faster hardware blocks. The algorithm gives priority to software blocks that have more descendents in hardware. This promotes maximum parallelism between hardware blocks.
Our algorithm maintains three sets of blocks: unscheduled (Sun), executing (Sex), and scheduled (S,,J. Figure 7 lists the pseudocode.
Software blocks require a variable number of cycles to execute, depending on the dynamic arguments. As discussed earlier, the simulation compiler determines the shortest and average execution times in software for each block. To guarantee that the results from hardware blocks are never read before they are available, the scheduler assumes that software blocks always execute in the shortest possible time. While this pessimistic assumption may result in the CPU spending more time waiting for hardware, in practice we found it has little effect. We calculate the average schedule length by using the average execution times for software blocks that appear in the final schedule.
Scheduling results. Despite the relative simplicity of our scheduling algorithm, it produced close to optimal schedules for the examples we considered. Table 1 is a breakdown of the processor time for the state machine, the unpipelined CPU, and the pipelined CPU models scheduled assuming unlimited hardware resources. Table 1 shows that the CPU spends almost no time waiting for hardware blocks to finish executing, and thus the schedules produced are very close to optimal. The success of our scheduling algorithm is due in part to the nature of the Verilog models we experimented with. They have plenty of explicit parallelism and very few data dependencies between blocks.
Software-harctware partitioning. This algorithm assigns each block to either software or hardware so that overall execution time is minimized. The algorithm has two phases: initial partition construction and partition improvement.
To guide the construction of a partition, the algorithm divides each block into three sets using the average software execution time (t,,), hardware execution time (t", and software-hardware communication time (t,,,J for each block.
The following three inequalities determine the members of each set:
Blocks that satisfy the first inequality belong to the software set (SJ. These blocks can never benefit from being placed in hardware. Blocks that cannot be synthesized into hardware also belong to this set. Blocks that satisfy the sec- ond inequality fit into the hardware set (S,,J. The execution time will always improve if these blocks are implemented in hardware. Blocks that satisfy the third inequality belong to the software-hardware set (Shm, , , >. The effect on the execution time due to the placement of these blocks can be either beneficial or detrimental depending on the placement of other blocks. The first phase of the partitioning algorithm constructs an initial partition using the three sets of blocks; the amount of FPGA resources available constrains this partition. The algorithm permanently assigns the software set blocks to software, and initially places the hardware and softwarehardware blocks in software.
The partitioning algorithm then tries moving each of the hardware and software-hardware blocks into hardware and measures the improvement in execution time after each move. The block that resulted in the greatest speedup when it was the only block in hardware is assigned to hardware. The process then repeats; each of the remaining unassigned hardware and software-hardware blocks are again moved into hardware. The block that results in the greatest speedup when it is moved into hardware along with the block already fixed there is itself fixed in hardware. This process iterates until either no moves result in an improved execution time or until no more blocks can fit in the FPGA. Figure 8 gives the pseudocode for this phase of the algorithm.
The second phase of the partitioning algorithm iteratively improves upon the initial partition. Each block from the fixed-in-hardware set is moved back into the placed-in-software set. The moved block stays in software as the algorithm from the first phase tries moving other blocks from the placed-in-software set into hardware to fill the space just vacated. The algorithm uses the partition that results in the greatest speedup as the new initial partition, and repeats the procedure.
This second phase finishes when a partition has no single element that can be removed from the fixed-in-hardware set without degrading performance. Since a shorter execution schedule is being found on each iteration, the algorithm is guaranteed to eventually converge on a locally optimal solu- tion. While theoretically this convergence could be slow, it was quite rapid for our example models.
To determine the sensitivity of the final solution to the criterion used to select blocks for movement into the fixed-inhardware set, we experimented with several different heuristics. In addition to using the best improvement measure given in Figure 8 , we also experimented with smallest block and best improvement/CLB measures. The pseudocode for these measures is As we show later, these heuristics do not significantly affect the partition on which the iterative improvement algorithm converges.
Performance results
Experiments with our software-hardware simulation compiler produced modest performance results. The experimental Verilog example programs range in complexity from a simple state machine. similar to the one shown in Figure  3 , to a pipelined processor that executes a subset of the Mips instruction set architecture.'* Table 2 lists the key characteristics of the three Verilog example programs.
We calculated speedups by comparing the execution times for the all-software simulations compiled with full compiler optimizations to the execution times for the software-hardware simulations, as estimated using the profiling data gathered in the software performance estimation section.
Assuming unlimited FPGA resources, our cosynthesis approach achieves speedups over the all-software simulator between 1.07 for the trivial state machine to 2.04 and 2.76 for the CPUs. (See Table 3 .) We compared the results of our algorithm for the unpipelined model with the optimal result found through exhaustive search; the two results were the same. These speedups represent the maximum speedup possible with our proposed approach. The speedups are somewhat low because the all-software implementation benefited from compiler optimizations that we did not incorporate into the software-hardware implementations. The softwarehardware implementations are up to four times faster than an unoptimized all-software implementation with the same limited C compiler optimizations.
For the limited FPGA resources case, we present the results
IEEE Micro
Best improvement
1
-Best improvemenVCLBs of experiments that vary the amount of FPGA resources and the heuristic used to generate the partition. These results show how the speedup of the software-hardware approach relates to the amount of FPGA resources available and how the heuristic used affects the speedup achieved by the iterative improvement partitioning algorithm. Figure 9 and Figure 10 show the speedup versus number of CLBs of the unpipelined and pipelined CPU models using the three partition generation heuristics described earlier. Figure 9 also compares the performance of the different heuristics used by the iterative improvement partitioning algorithm. Best improvement/CLB and best improvement heuristics produce nearly identical results, while smallest block performs slightly worse.This indicates that the iterative improvement algorithm is relatively insensitive to the heuristic used, but that a performance-directed heuristic is preferable to one that conserves FPGA resources. Figure 11 and Figure 12 provide a breakdown of CPU execution time for the unpipelined and pipelined CPU models. As the number of CLBs used increases, the ratio of time spent communicating between the CPU and the FPGAs increases, limiting the speedup gained by placing more blocks in hardware. The ratio of CPU time devoted to communication increases as more blocks are added to hardware-both because more communication is required and because fewer blocks execute in software.
The ratio of unpipelined model time spent communicating increases from 23. WE BASED OUR APPROACH to a fully automated software-hardware cosynthesis method for simulating Verilog HDL models on accurate measurements of execution times in software and hardware. As a result, we can accurately evaluate the performance benefits of placing a block in hardware or software. By combining partitioning with an efficient scheduling algorithm, we placed blocks in hardware or software so that performance is maximized. With unlimited FPGA hardware, our partitioning algorithm produced an optimal partition for all our experimental models. Using these techniques achieves modest, but significant, speedups over all-software simulation. Furthermore, in our experiments a single FPGA chip produced most of the performance benefits of a software-hardware simulator.
This approach requires further refinement. The scheduling results show that communication severely limited the speedup possible with more FPGA resources. Communication costs significantly degraded the performance of our software-hardware simulator. Thus we see reducing communication overhead as an important goal.
To reduce communication overhead. we could enhance the partitioning algorithm by using the dataflow analysis from VCC to partition blocks between software and hardware such that communication is considered and minimized where beneficial. Ideally, when the CPU writes to the FPGA, the data should be reused by as many hardware blocks as possible. Furthermore, it should be possible to place blocks in the FPGA so that the results of one hardware block can be passed as arguments to another hardware block without requiring CPU intervention.
