Platform-based design faces an essential tension: it wants to reuse previous designs to leverage IP; but new designs must necessarily include work that is not complete or well-characterized. Synthesis and analysis methods are therefore important elements in platform-based design methodologies. This paper describes two aspects of this problem. First, we consider the first-generation dilemma, in which we must use additional analysis in order to create the first generation of a platform. Second, we describe a synthesis method that creates accelerators for platforms whose clock periods have been optimized for the system characteristics.
Introduction
Platform-based design has become a leading design methodology for systems-onchips (SoCs). Platform-based design uses existing designs as starting points for new system implementations. By building from existing designs, platform-based design leverages IP to reduce design time and risk. However, new implementations must have some new elements to modify the platform and create a unique system. This paper concerns itself with two aspects of the adaptation of new components to platform-based design.
The first problem described in this paper is the first-generation dilemma. Platform-based design works because we have an existing design that we can analyze. However, when designing the first generation of a platform for a particular application space, we do not have an existing implementation.
The second problem we consider is the synthesis of performance-tuned hardwired accelerators. Platform-based designs generally include embedded processors but may also need specialized hardware accelerators for performance and/or power considerations. Previous work has not considered how to tune the clock rate of such accelerators during synthesis. We have developed a co-synthesis algorithm that uses high-level synthesis to characterize the clock rate of the accelerator so that the accelerator's design can be tuned to the system requirements.
The next section surveys previous work. Section 3 provides some general observations on platform-based design. Section 4 concentrates on design methodologies. Section 5 describes the first-generation dilemma. Section 6 describes our work on the synthesis of accelerators with custom clock rates.
Related Work in System-Level Design Tools
In a few years, over one billion transistors will be available on a single chip, 1 making systems-on-chips common place. Such ample resource offers cheaper while more functional designs, but these designs are too difficult and expensive for traditional RTL design methodologies. System-level designs are introduced to absorb the growing complexity and accelerate larger designs. They take advantage of IP cores and the idea that right system-level design choices could save much more than right lower level design choices in complex designs. Many system-level design tools were announced in recent years, including CoWare N2C, 2 Cadence VCC (Virtual Component Co-design), Summit Visual Elite, 3 Elanix SystemView, 4 and Synopsys CoCentric System Studio.
5
Many types of system-level design tools have been developed. Modeling languages and simulation systems allow system designs to be described at a relatively high level of detail. Interface synthesis tools create hardware and software interfaces between components such as software running on microprocessors and I/O devices. Synthesis tools have been developed for specialized design domains such as signal processing and reactive systems.
Hardware/software co-synthesis 6 is the process of designing the hardware and software modules to meet performance, power and cost goals for an embedded system. The output of a co-synthesis tool is a cost-minimal distributed computing system that meets all system specification and constraints. There is a great deal of previous work in hardware/software co-synthesis. [6] [7] [8] Hardware/software partitioning algorithms were the first type of co-synthesis algorithm. They implement the system specification on some sort of architectural template, usually a single CPU with one or more ASICs connected to the bus. Distributed system co-synthesis does not use an architectural template to drive co-synthesis. Instead, it creates a multiprocessor architecture for the hardware engine. The target architecture is usually heterogeneous in both its processing elements and its communication channels. It can employ multiple CPUs, ASICs and FPGAs.
The SOS system, 9 which was the first hardware/software co-synthesis method, is an exact method that uses mixed integer linear programming technique (MILP). They reported CPU times exceeding 10 000 CPU seconds for small examples. However, optimal approaches are suitable only for small task graphs and impractical. As a result, most researchers use heuristics to find a solution quickly and efficiently. There are two distinct approaches in the heuristic domain: iterative and constructive. The iterative approach begins with an initial solution and improves it.
7,10 A constructive algorithm builds the solution step by step and a complete solution is not available until the algorithm is finished. 8 MOGAC 11 deploys an adaptive multiobjective genetic algorithm; both cost and power consumption are optimized while the hard real-time constraint is met.
Platform-Based Design
Platform-based design is one recent answer to the demands imposed by Moore's Law. Embedded processors help us separate functional and performance designwe can re-use a CPU logic design that has been carefully tuned for performance and use programs to add the required functionality. Embedded processors can reduce design time and make design processes more predictable. But applications such as network infrastructure, multimedia, and cellular telephony require high performance on real-time deadlines that cannot be met by a simple CPU system. As a result, system-on-chip designers must be able to design custom, heterogeneous multiprocessors. Traditional multiprocessors used for scientific and database computing have regular architectures, with the same architecture used for many different applications. Today's systems-on-chips need custom multiprocessor architectures because they must fit the entire machine on a single chip. Once a part of the system goes off chip, performance goes down and power goes up due to the lower bandwidth and longer delays of chip-to-chip connections. Keeping the entire multiprocessor on a single chip often requires creating more heterogeneous architectures that trim some resources in order to put more logic where it provides the most benefit for the application at hand.
Many markets are large enough that it makes sense to design a custom heterogeneous multiprocessor rather than take an architecture off-the-shelf. However, system design is never cheap and companies are always looking for ways to reuse designs. Luckily, many markets are large enough to support several products that require similar underlying architecture. This is particularly true in standards-based markets, in which standards committes set the basic requirements on products and set the framework for competition.
But in any standards-driven markets, companies need and want to differentiate their products. Standards generally define what needs to be done and allow the implementers freedom in how to perform the tasks required of the standard. Companies can add value to their systems by using novel algorithms that provide improvements in certain tasks required in the standard; improved quantization algorithms are one example in multimedia. They can design for low power, as is common in cell phones. They can also add user interface or other features that are not controlled by the standard; cell phone manufacturers often differentiate themselves with phone directories and other user interface features.
Platform-based design [12] [13] [14] [15] is the approach that many companies are now using to make the most effective use of system-on-chip technology. Platforms capture heterogeneous multiprocessors in ways that allow system-on-chip designers to customize those designs for the needs of a particular product.
The term platform-based design has been used in a variety of ways by different authors. We use the term platform to mean a specific system architecture that has been architected to be reused over several different products. Reuse may be explicit -for example, by adding a reconfigurable logic block that can be programmed to provide a unique function -or implicit, for example by providing enough performance and programmability to provide flexibility in how important operations are implemented. By architecture we mean both hardware and software: a successful platform often includes libraries and application program interfaces (APIs) that provide services based on the underlying hardware's capability.
A platform may, for example, be a bus-based system that includes multiple CPUs, special-purpose accelerators (custom logic designed to perform a particular operation), I/O devices, and on-chip memory. The types and numbers of I/O devices and the amount of memory required are clearly application-dependent choices. The numbers and types of CPUs, the types of additional accelerators, and the bandwidth of the bus are also very important design decisions that must be made based on the application's characteristics. One example of a platform is the TI OMAP platform.
16 This architecture includes an ARM CPU and a TI DSP in a multiprocessor configuration. OMAP is designed for use in cell phones and other telcom applications.
A platform for systems-on-chips is usually designed by a semiconductor house and used by a systems house. The platform can be used by the semiconductor house as a form of advertisement for their capabilities in a given area, such as wireless infrastructure or multimedia. Systems houses who have a particular product in mind can then use the semiconductor house's platform as a starting point for their chip design.
The platform could be customized for the systems house customer in a variety of ways. The simplest way is simply to execute new software on the platform. However, this method is more complex than simply loading new software on a PC. Most embedded systems have real-time requirements and the execution time of the software depends on the characteristics of the underlying hardware platform. Similarly, many embedded systems have power constraints and the power consumed by executing programs depends on the platform characteristics. It may take a great deal of tuning to get the customer's software running properly on the platform.
A platform may also contain a programmable logic block, such as a field programmable gate array (FPGA) core. The FPGA core could be programmed by the customer to add customized logic for I/O or specialized compute-intensive operations. Both software and FPGA customization have the advantage that they do not require new masks (assuming that the software or FPGA personality can be loaded from external ROM in the envisioned system). Masks are increasingly expensive -$600 000 for a 0.18 micron mask set -and the smaller production runs incur additional costs.
Augmenting Platform-Based Design with Synthesis Tools 129
However, a platform may also be customized by modifying the original platform design. Changes may include I/O devices, the amount of memory, the partitioning of memory between cache and main memory, the addition of specialized computational accelerators, etc. Hardware changes allow the systems house to further differentiate their product from others based on the same platform.
A platform-based system-on-chip design goes through two stages. The first stage creates the platform. The platform design process considers the needs of the application domain, including what is required by relevant standards and other descriptions of requirements. The platform design may also take advantage of experience from previous generations of products and platforms. The platform design stage should also develop software that allows programmers to make use of the software's capabilities. Of course, the software libraries for the CPUs used in the platform form a good software base. Beyond that, the platform needs to provide software to take advantage of peripheral devices and computational accelerators. The platform software also needs to provide facilities to manage multiprocessing, interprocess communication, etc.
The second stage creates a product using the platform. This stage must take into account the needs of the customer. The work of turning the platform into the product may be done by a combination of personnel from the systems and semiconductor houses. The semiconductor house designers are most experienced with the platform itself, while the systems house designers understand their customization requirements. The suitability of the platform can be evaluated by how completely the customer's requirements are satisfied by the platform and how quickly the customer's product design can be completed.
If this process works well, much of the hard work should be done at the platform design stage. Many of the important design decisions for this application domain should be taken at the platform design stage. That stage may entail running a large number of experiments to evaluate the effects of various architectural choices on performance and power. Tailoring the platform to the needs of a particular product should go quickly.
Platform reuse can be either vertical or horizontal. Vertical reuse means using the same platform architecture across several generations of IC technology. In contrast, horizontal reuse means using the same platform over several different products in the same technology generation. Horizontal reuse may initially seem less efficient since lower-end products may not be able to take advantage of all the features of the platform needed for the high end of the product mix. However, relatively small increases in chip manufacturing costs can be offset by reduced design cost and simplified inventory. Furthermore, a new generation of chip technology often provides enough new opportunities that it is worth rethinking the platform. As a result, most reuse is horizontal.
Platform-Based Design Flows
In this section, we will use Cadence VCC (version 2.1) to illustrate a typical platform-based design flow. We will concentrate on performance-constrained designs in this section, but many of our remarks also pertain to power/energy-constrained designs. Figure 1 shows the design flow. The VCC tools support the flow from specifications to register-transfer design; other tools complete the design to layout. In addition to supporting back-end tools, a variety of system-level design tools must be able to talk to each other. For example, Cadence VCC can export designs to the Mentor Seamless co-verification system. VCC divides hardware/software allocation into several steps: behavioral modeling; architectural modeling; and mapping. Behavioral modeling is a key step for several reasons:
Design flow
• It details the functional specification.
• Parts of the behavior will end up as object code running on embedded CPUs.
• Other parts of the behavior will result in custom hardware.
Behavior models could be written in many languages, such as C, C++, SDL, and STD. In our experience, C and C++ are convenient because they are so widely used. When writing a behavior model, designers need to choose between a software-oriented and hardware-oriented style of writing code. A software-oriented style is better suited to behavior that will end up implemented in software, while a hardware-oriented style is better suited to hardware implementation. There is currently no compilation technology that lets us easily translate between hardwareoriented and software-oriented descriptions in any language despite many years of work in the area.
As described in Table 1 , VCC has three categories of behavior models: black box, white box, and clear box. Each has distinct uses. For example, a black-box model can be used for functional simulation but not performance analysis or synthesis; it could be written in C++, SPW, SDL, or OMI. The last three language forms are used to import models from other tools. Designers can build a separate performance model to accompany a black-box model. This is a totally manual process and there is no way to verify the relationship between the black-box and performance models.
It is almost impossible to transform a behavior model from one category to another without an overhaul since the categories use different languages. For this reason starting the design process from a white box behavior model gives more flexibility; there are also many well-known techniques for manually rewriting C in other languages.
The architecture model is built from IP cores and custom modules, connected together to create the desired architecture. VCC can analyze the performance of IP cores and custom modules using abstract performance models; it does not need the full implementation of an IP core, which may not be available due to licensing By mapping a behavior model onto an architecture model, VCC allocates the elements of behavior model to software and hardware. The architecture's performance is analyzed based upon the mapping, and the necessary adjustment follows. Interface refinement chooses proper communication methods among hardware and software to meet the performance requirements, usually after major mapping decisions have been made. Finally, the design can be exported to a co-verification tool and other tools to finish synthesis and object code generation.
Performance analysis
Performance analysis is a key metric used by designers to evaluate the mapping of behavior onto architectures. Clearly, performance analysis must be sufficiently accurate to avoid suggesting inappropriate changes to the designer. VCC needs a performance model for every IP core and custom module as well as some behavior models. To use the performance model, VCC first extracts scheduling information from the behavior, then, based upon mapping, analyzes the system schedule to determine performance. VCC does not help designers create the initial performance estimates for modules. If the module already exists, performance information can be extracted from the design. However, if the module has not yet been built, designers are left to their own devices to estimate performance. Creating a performance module with inaccurate timing information will lead to inaccurate system-level performance analysis results and possibly to incorrect design decisions.
The First-Generation Dilemma
The platform-based design methodology described in the last section is organized to converge on a good design quickly by iterative analysis and refinement of the design. Platforms are rarely, if ever, designed independent of an application or product that will be the first instance of the platform; platforms are also typically improved over their lifetimes. At each step in the evolution of the platform, the design's performance is analyzed and the design is modified in order to eliminate problems that keep the design from achieving its performance goals. Since performance is based on high-level models, the design is then reanalyzed to determine the success of the transformation. This process continues until the design meets its performance goals.
However, in the first generation of the platform, many of the components to be used are not yet available. IP from outside vendors may not have been selected or purchased; custom modules may not yet have been designed. In fact, the selection of IP and the requirements for custom modules will often be determined by architectural choices that have not yet been made. This poses a problem for platform-based design: designers depend on architectural performance analysis to help select components; but the components are not yet available (we made a similar observation about hardware/software co-design:
10 allocation and scheduling are intimately related and some way to break the dependency between them must be found to start the design process). We refer to this problem as the first-generation dilemma.
In order to solve this problem, we need to update our platform-based design methodology, as shown in Fig. 2 . Rather than trying to achieve performance constraints through mapping adjustment and architecture model refinement, designers should find out what the performance constraints on the whole system means for new modules in a particular mapping and architecture model:
• An initial candidate mapping and architecture model is created.
• Available IP and custom modules are inserted into the design.
• The performance models of unavailable IP modules are derived based upon the performance of the available components and the interactions between the components in the architecture.
Richter et al. 18 describe a methodology for deriving the performance constraints of a component from a partial design. This methodology combines deadline and rate constraints from the system and existing components with structural constraints from the platform design to create a full set of constraints on the un-designed components.
At this point, the designer has a complete model. The requirements of unavailable components need to be evaluated against existing IP or against feasibility studies for custom synthesis. Infeasible requirements can be satisfied in several ways. The architecture can be redesigned to change the requirements of the component. The component can also be implemented in a different design methodology that does not possess all of the limitations of the desired methodology; for example, hard IP may be necessary even though soft IP would be more desirable.
Synthesis of Custom Accelerators
As mentioned in Sec. 3, we use the term accelerator to denote a custom hardware block. Accelerators often provide performance and power improvements relative to embedded software implementations. However, the accelerator must be carefully designed to ensure that it really speeds up the system and that it is cost-effective. In this section we introduce synthesis methods that let us choose the right clock period for an accelerator based on the system characteristics.
In this section, we will first overview previous work on this problem, then describe our models and algorithms, and finally present results of our synthesis algorithm.
Accelerator performance analysis
Previous work in co-synthesis has used simple methods to predict the results of downstream synthesis used to implement the custom modules created by cosynthesis. Either very simple models were made, which provides inaccurate and misleading results, or synthesis tools were run directly, which takes too much time.
Henkel and Ernst
19 developed path-based high-level estimation techniques for usage in hardware/software co-synthesis. However, previous work has not concentrated on optimizing the clock period of the accelerator in the system context.
Our co-synthesis algorithm is closely coupled to a high-level synthesis tool in order to quickly provide accurate estimates of cycle time. The Monet highlevel synthesis tool 20 from Mentor Graphics provides scaffolding for analyzing the performance of candidate custom module implementations during co-synthesis. Monet's efficient synthesis algorithms allow us to quickly estimate the results of high-level synthesis, which provides the worst-case execution time and area required to evaluate candidate custom module designs.
By using Monet to generate multiple custom module implementations and analyze their performance, we developed a heuristic iterative improvement algorithm 21, 22 for distributed embedded system co-synthesis. The algorithm can take into account the impact of different custom module implementations of tasks on system performance and cost, targeting conditional task graph, which takes care of the control dependencies in applications. Figure 3 shows the design space for one custom module. Designers may be interested in various sections of the design space: fast implementations, which generally cost more area; or small-area designs, which are often slower. A design may also call for an intermediate design that runs at moderate speed but does not use excessive area. 
Co-synthesis models
Our co-synthesis problem is specified by a set of real-time applications, an architecture template, and a technology library.
The real-time applications are periodic. They may run at their own rates, with no restriction on the relative or absolute rates. The set of applications is modeled by a task graph, 6 which is a directed acyclic graph, as shown in Fig. 4 . Nodes represent tasks or processes. The directed edges represent data dependencies between tasks: the edge A → B implies that task B cannot start execution until A is finished. Data dependency edges ensure the correct order of execution. Each edge is weighted with a scalar that specifies the amount of data that must be transferred between the two connected nodes. Each connected subgraph in the task graph is assigned a weight that describes its periodic execution rate. The deadline of the task graph can be smaller or larger than the period; for simplicity, we assume here that the deadline is equal to the period. The target architecture is a heterogeneous shared memory multiprocessor, as shown in Fig. 5 . It includes one or more processing elements (PEs), which may be either CPUs or accelerators. Each CPU has its private instruction cache and data cache. We use the task-level cache performance model of Li.
7 Each task can have many implementation options differing in processing element type, cost and execution time.
The technology library describes the components of both the architecture model and the task graph. It describes the various types of CPUs available for the architecture. It also describes the worst-case execution time (WCET) of each task in the task graph. Each task description includes the WCET for every type of processor on which the process can be implemented. If a task can be implemented as an accelerator, then there is a related behavioral VHDL file for this task.
The co-synthesis algorithm allocates processes to PEs and chooses the number and types of components in the target architecture from the technology library, such that the applications can be scheduled to meet their performance constraints (deadlines) and the total cost of the resultant system is minimized.
Co-synthesis algorithm
Our co-synthesis algorithm uses an iterative improvement strategy. The major steps in the algorithm are: During pre-processing, we use the Monet architectural exploration system to estimate bounds on performance for each synthesizable task from a behavioral VHDL description. The speed (performance) and the area (cost) are for the following two implementations:
• the fastest accelerator implementation;
• and the smallest accelerator implementation.
Augmenting Platform-Based Design with Synthesis Tools 137
This information is put into the technology library. We assume that the cost of the accelerator is proportional to the area of the accelerator, which is reasonable in the system-level design.
23
The initial solution is constructed by assigning each task in the task graph the fastest PE that is available for the task. If the PE is a CPU, then the instruction and data caches required for that task's program or data size are added to that CPU. If the PE is an accelerator, then the task is implemented as the fastest accelerator, whose worse case execution time and area were extracted during pre-processing. The performance of the initial solution is evaluated, assuming the communication delay between PEs is zero. If it cannot meet the deadline constraints, there exists no feasible design given the current technology library, and the algorithm stops without a solution.
The initial solution generally uses too much hardware. Iterative improvement creates an architecture that meets the performance goals at a more reasonable hardware cost. Since the accelerator implementation is usually faster than the software implementation in CPU for a specific task, most or all of the PEs in the initial architecture will be accerlators. As a result, iterative improvement generally reduces the number of accelerators available in favor of CPUs. The accelerators that are left after iterative improvement are generally performance-critical processing elements.
A single iteration of cost reduction consists of two procedures: accelerator to CPU substitution and CPU cost reduction. The accelerator to CPU procedure tries to move tasks from accelerator to CPU (from hardware to software), while the CPU cost reduction procedure reduces the CPU cost and cache cost.
We use two heuristics to select tasks on accelerators and move them from accelerators to CPUs. The first heuristic is related to the difference between the hardware speed and software speed. If the difference D (D = WCET of fastest accelerator-WCET of fastest CPU) is small, then implementing the task as an accelerator will not provide much speedup compared to a CPU. Therefore this task is a good candidate to be implemented on a CPU to reduce cost. The second heuristic uses the task graph's critical path. If a node is not on the critical path, it is a good candidate to be moved from accelerator to CPU. Since the critical path moves as the tasks are reallocated, we use a simple method to decide the critical path and decide if a node is on the critical path.
24
The Earliest Start Time (EST) for a node is the first time at which the node can be executed. For node i, which is allocated to PE(node i), After calculating the EST and the LST for each node in the task graph, if EST(node i) = LST(node i), then node i is on the critical path.
A single iteration of the CPU cost reduction procedure tries to eliminate lightly loaded CPUs after moving the tasks on those CPUs to other CPUs. The CPUs in the current design are ordered by their workload and the algorithm starts from the most lightly loaded CPU. For each CPU, we identify the tasks on it that can be executed on other CPUs; these tasks are then moved to the other CPUs that provide the best performance for the tasks; the cache sizes of the other CPUs increase to accommodate the new tasks. The CPU is removed if it has no tasks left. When the tasks on a CPU cannot be moved to other CPUs, the algorithm tries to replace the current CPU with a cheaper alternative. Finally an attempt is made to cut its instruction and data cache size.
The next step is accelerator cost reduction. Iterative improvement substitutes CPU types but not accelerator types, so all accelerators left in the design are the fastest possible implementations. Our accelerator cost reduction procedure tries to reduce the accelerator cost by trying to replace the fastest accelerator implementation with the smallest accelerator implementation or the intermediate implementations.
We find opportunities to use slower accelerator implementations by looking for slack in the system schedule. There are two types of slacks can be utilized to slow down the accelerators and reduce the area of accelerators:
• Global slack, which is the minimum slack between the deadline and the completion time of the tasks.
• Local slack, which is the minimum slack between the accelerators completion time and the starting time of its successor tasks.
The cost reduction procedure is illustrated in Fig. 6 . To start the accelerator cost reduction phase, the accelerators in the current design are ordered by the difference of area between the fastest accelerator implementation and the smallest accelerator implementation (line 1). We start from the accelerator with the largest D and try to replace that fastest accelerator with its smallest implementation (line 3). If it is not feasible, we keep the fastest accelerator implementation (line 5). replace the fastest accelerator with the smallest accelerator; 4.
if (meet deadline) use the smallest accelerator; 5.
else keep the fastest accelerator 6. calculate the global slack; 7. for each fastest accelerator i; 8.
slow down accelerator i by (global slack + local slack i); 9.
for each other accelerators 10.
Slow down accelerator by its local slack 11.
call Monet to get the total accelerator cost COST(i) 12. select the minimal COST(i) and select corresponding accelerator implementations. After the loop (lines 2-5), the accelerators in the design are either the fastest or the smallest implementation. The next step tries to cost-reduce fastest-implementation accelerators with intermediate implementations. First, the global slack is calculated (line 6), then for each fastest accelerator, we calculate its local slack. We slow down the accelerator speed by (global slack + local slack) (line 8) and slow down other fastest accelerators by their own local slack (line 10). If the new schedule meets the deadline, we call our accelerators performance analysis tool to get the cost of current accelerators setting (line 11), and keep the cost and setting (line 11). We save this accelerator implementation as a candidate substitution. After the loop (lines 7-11), we choose the accelerator implementation with the minimum cost and change the system permanently (line 12) and calculate the final schedule and allocation. We need to evaluate several alternative accelerators because the design space is not monotonic or regular, so in some cases a slightly faster implementation may also require less area.
In the last phase, we schedule and allocate the design. The scheduler in our algorithm 21 is similar to that designed by Sih and Lee 25 as well as Li. 7 This allocation and scheduling algorithm is less expensive than the other existing algorithms and gives an effective schedule; it can also balance the load on the hardware structure. This makes the algorithm suitable for use in the design space exploration of our co-synthesis algorithm, in which an allocation and scheduling algorithm is called repeatedly.
Experimental results
We have evaluated our co-synthesis algorithm with a C++ implementation of about 5000 lines of code. Our program calls Monet to evaluate accelerator designs. We also designed a graphical user interface written by Tcl/Tk. All of our experiments were run on Pentium Pro 200. We used examples from related co-synthesis research to evaluate our algorithm: PP1 and PP2 are Prakash and Parker's example1 and example2, 9 ex1 is one of Li's examples.
7
Since PP1 and PP2 examples have no accelerator models, we first ran our algorithm by constructing the initial solution to be all CPUs. Prakash and Parker's ILP approach is very time-consuming, but Li/Wolf 7 (an iterative improvement algorithm), COSYN 8 (a constructive algorithm), and our algorithm can find a solution with the same cost as Prakash and Parker's optimal solution but can do so in less than 1 CPU second.
Our results are summarized in Fig. 7 . We ran our algorithm with different accelerator implementations for the PP2 and ex1. 
Conclusions
Platform-based design is likely to dominate the next generation of SoC designs because it is a sound methodology for reusing hardware and software designs. However, with platforms constantly evolving, new design work always needs to be done. This paper has explored two aspects of the difficulties encountered when we do not have all the IP blocks that we would like, to complete a platform-based design.
The first-generation dilemma is one example of a situation in which we need more information from synthesis and analysis tools in order to complete the design. The accelerator synthesis method we described is one way to solve this problem, since it closely couples multiple stages of synthesis in order to optimize the overall system design. We believe that synthesis tools like this will help to make platform-based design practical.
