As system-on-chip (SoC) designs become more complex, it is becoming harder to design communication architectures to handle the ever increasing volumes of inter-component communication. Manual traversal of the vast communication design space to synthesize a communication architecture that meets performance requirements becomes infeasible. In this paper, we address this problem by proposing an automated approach for floorplan-aware bus architecture synthesis (FABSYN) to synthesize cost-effective, bus-based communication architectures that satisfy the performance constraints in a design. Our synthesis approach incorporates a high-level floorplanning and wire delay estimation engine to evaluate the feasibility of the synthesized bus architecture and detect bus cycle time violations early in the design flow, at the system level. We present case studies of network communication SoC subsystems for which we synthesized bus architectures, detected and eliminated timing violations, and generated core placements in a matter of hours instead of several days for a manual effort.
specific performance requirements, is a very time consuming process. This is due to the large exploration space created by customizable bus topologies, arbitration protocols, direct memory access (DMA) burst sizes, data bus widths, bus clock speeds, and buffer sizes, all of which significantly impact system performance [5] , [12] , [26] .
To counter the challenge of ever increasing on-chip bandwidth requirements and a vast communication exploration space, early planning of the interconnect architecture at the system level must become an integral part of an SoC design process. However, the complex interplay between communication architecture parameters is becoming hard to analyze effectively, especially at the system level. Very often, designers end up evaluating the communication design space by creating simulation models annotated with detail based on experience, and manually iterating through different design configurations. Such an effort remains time consuming and produces systems which are generally overdesigned for the application at hand.
To address this problem, we propose a floorplan-aware bus architecture synthesis (FABSYN) approach in this paper, which automates the generation of a cost effective communication architecture for an SoC. We make use of SystemC [23] to quickly capture components at the behavioral level and automate the bus architecture synthesis for the design. The novelty of our approach is in the ability to automatically satisfy performance constraints and detect bus clock cycle time violations, while synthesizing a feasible, low-cost configuration of a standard bus-based communication architecture (such as [2] ) which is commonly used in SoC designs. Our approach synthesizes the bus topology, as well as values for bus architecture parameters such as arbitration priority orderings, data bus widths, bus clock speeds, and DMA burst sizes. We make use of a high-level floorplanning engine to generate estimates of core placements on the chip. Typically, once the system architecture is frozen, it takes several months before a floorplan of the design becomes available. Violations of bus clock cycle time constraints (described in more detail in Section III-E) detected late in the flow at the physical implementation stage can require changes in the architecture which can severely impact time-to-market. Since the bus architecture synthesis process determines the number and type of components assigned to each bus, which decides the cumulative load capacitance on a bus and which, in turn, has a direct impact on signal delay and bus clock cycle time constraint satisfiability (Section III-E), there is a need to make the synthesis process more physically aware. Our high-level floorplanning and wire delay estimation engines detect bus cycle time violations early in the design flow at the system level, during the syn-thesis process, where architectural modifications and tradeoff analysis can be performed quickly and efficiently to eliminate such violations. To demonstrate the usefulness of our approach, we present case studies of network communication SoC subsystems, used for data packet processing and forwarding. Compared to a manual effort which took several days and produced overdesigned systems, our automated flow synthesized low-cost bus architectures, detected and eliminated timing violations and generated core placements which satisfied performance constraints for the SoC subsystems in a matter of hours.
II. RELATED WORK
There is already a significant body of research in the area of bus architecture synthesis. Early work was aimed at minimizing bus width [6] , interface synthesis and simple synchronization protocol selection [7] , and topology generation for simple buses without arbitration [8] . Ryu et al. [9] performed studies to find optimal bus topologies for an SoC design. Pinto et al. [10] proposed an algorithm for constraint-driven topology synthesis under the assumption that relative positions of components were fixed. Lyonnard et al. [11] proposed a synthesis flow which supported shared bus and point-to-point connection templates. These templates have to be parameterized manually, which makes the process time consuming. Lahiri et al. [12] designed communication architectures after exploring different solutions using fast performance simulation. However, they assumed the bus topology to be given. Shin et al. [13] used a genetic algorithm for automating the generation of bus architecture parameters to meet performance requirements. However, they do not focus on bus topology synthesis. Our approach differs from these existing approaches in the way we automate the synthesis of not only the bus topology, but also the generation of values for bus architecture parameters, while also satisfying performance constraints.
A key component of our synthesis flow is the integrated floorplanner. There have been other approaches in the past which have made use of a floorplanning tool [14] - [18] in a synthesis flow, but for different reasons. Bergamaschi et al. [18] and Thepayasuwan et al. [14] used the floorplanner to generate an early core placement estimate. Drinic et al. [15] used the floorplanner to determine feasibility of the synthesized design by comparing estimates of wire length with an upper bound on wire length. However, an upper bound on wire length has the disadvantage of not accounting for varying capacitive loads of the components. Hu et al. [16] also used the floorplanner to estimate wire length, which they used to calculate energy consumption in point-topoint networks. Dick et al. [17] invoked the floorplanner repeatedly in their custom bus topology synthesis approach to obtain global wiring delays and ensure that real time deadlines were met. Unlike existing approaches, the floorplanner, in our approach, is used to identify and eliminate bus cycle time violations early in the design flow. We believe that this step will become increasingly important in the deep-submicrometer era as clock speeds increase and lengthy propagation delays cause frequent violations of timing constraints that will need to be detected and corrected early in the design flow if shrinking time-tomarket constraints are to be met. 
III. AUTOMATED BUS SYNTHESIS
This section describes our approach for automated bus architecture synthesis. Section III-A discusses how SoC performance requirements are represented in our approach. Section III-B presents our assumptions and states the problem description. Section III-C discusses the simulation engine while Section III-D describes communication parameter constraints, which guide the bus synthesis process. Section III-E gives an overview of our floorplan and wire delay calculation engines used for detecting timing violations in the design. Finally, Section III-F presents our automated bus architecture synthesis approach in detail.
A. SoC Performance Requirements
Typically, SoC designs need to satisfy performance requirements that are dependent on the nature of the application. The throughput of communication between components is a good measure of the performance of a system [8] . We assume that we are given one or more throughput constraints that need to be satisfied for the system. These constraints can involve communication between two or more IPs. Fig. 1 shows a Communication Throughput Graph (CTG) which is a directed graph, where each vertex represents a component in the system, and an edge connects components and that need to communicate with each other. Each vertex contains information about the component it represents, such as its area, dimensions (fixed width/height or bounds on aspect ratio), capacitive loads on output pins and which bus type it can be connected to-a main high bandwidth bus like AHB [2], a peripheral low bandwidth bus like APB [2] , or both. An edge is associated with a throughput constraint if it lies within a throughput constraint path (TCP). Fig. 1 shows a TCP involving CPU1, MEM1, S1, and M2 components, where the rate of data packets streaming out of M2 must not fall below 360 Mb/s. A TCP, in general, has a single master for which data throughput must be maintained and other masters, slaves, and memories which are in the critical path that impacts the maintenance of the throughput.
B. Problem Description
We are given an application for which we assume the HW/SW partitioning has already been performed. The resulting SoC design has possibly several hardware and software components (IPs) onto which application functionality has been mapped and which need to communicate with each other. The standard bus-based communication architecture (e.g., AMBA [2] , CoreConnect [3] ), which determines the pins at the IP interface and for which the bus topology and communication parameter values must be synthesized, is also specified. The IPs are assumed to be standard "black box" library components which cannot be modified during the bus synthesis process, except for the memory components.
The goal of the FABSYN communication architecture synthesis approach is to determine the number of buses and the allocation of SoC IPs on these buses (bus topology synthesis), and generate values for arbitration priorities, data bus widths, bus clock speeds, and DMA burst sizes (bus architecture parameter synthesis) for the selected standard bus-based communication architecture, while ensuring that all system throughput constraints are satisfied. In addition, we want to consider layout information of the chip to detect bus cycle time violations early in the design flow, so that we can modify the bus architecture to eliminate these violations which might otherwise take up costly design iterations later in the flow.
This leads us to our problem definition: Problem Definition: A bus can be considered to be a partition of the set of components in a CTG, where . Our primary objective is to determine a component to bus assignment for a hierarchical bus architecture, such that the partitioning of onto buses results in a minimal number of buses and satisfies bus cycle timing constraints, while meeting all performance requirements in the design, represented by the TCPs in a CTG. As a secondary objective, we attempt to reduce the clock speeds and data widths of the buses in the synthesized solution.
C. Simulation Engine
Since communication behavior is characterized by unpredictability due to dynamic bus requests from cores, nondeterministic bus contention delays, buffer overflow delays etc., a simulation based approach is necessary for accurate performance estimation. In our synthesis flow, we capture behavioral models of components and bus architectures in SystemC [23], and keep them in an IP library database. Since we were concerned about the speed of simulation, we chose a fast transaction-based, bus cycle accurate modeling abstraction, which averaged simulation speeds of 150-200 kHz [5] , while running embedded software applications on processor instruction-set simulator (ISS) models.
D. Communication Parameter Constraints
The exploration space for a typical SoC bus-based communication architecture such as AMBA [2] consists of combinations of bus topology configurations with communication parameter values for arbitration schemes, data bus widths, bus clock speeds, and DMA burst sizes. If we allow these parameters to have any arbitrary values, an incredibly vast design space is created. The time required to simulate through all possible system configurations searching for one which satisfies every design constraint would become unreasonably large, even with the fast simulation engine. More importantly, once we manage to find such a system configuration, there would be no guarantee that the values generated for the communication parameters would be practically feasible. To ensure that our synthesis approach generates a realistic communication architecture configuration, we allow the designer to specify a Communication Parameter Constraint set . These constraints are in the form of a discrete set of valid values for the communication parameters to be synthesized. A major motivation to allow this constraint specification is that it allows the designer to bias the synthesis process based on knowledge of the design and the technology being targeted. For instance, a designer might decide that the synthesized design should only have data buses with 16, 32, or 64 bit widths, because the IPs in the design cannot support larger widths effectively. Or a designer might set the allowable bus clock frequency to multiples of 33 MHz, with a maximum speed of 166 MHz, based on the operation frequency of the cores in the system and past experience of the clock generation mechanism. Such knowledge about the design is not a prerequisite for using our synthesis framework. As long as is populated with any discrete set of values for the parameters, our framework will attempt to synthesize a feasible communication architecture. However, informed decisions can greatly reduce the time taken for synthesis and help the designer generate a more practical system.
E. Floorplanning and Delay Estimation Engines
The floorplanning stage in a typical design flow arranges arbitrarily shaped, but usually rectangular blocks representing circuit partitions, into a nonoverlapping placement while minimizing a cost function, which is usually some linear combination of die area and total wirelength. Our floorplanning engine is adapted from the simulated annealing based floorplanner proposed in [19] . The input to the floorplanner is a list of components and their interconnections in the system. Each component has an area associated with it (obtained from RTL synthesis). Dimensions in the form of width and height (for "hard" components) or bounds on aspect ratio (for "soft" components) are also required for each component. Additionally, maximum die size and fixed locations for hard macros can also be specified as inputs. Given these inputs, our floorplanner minimizes the cost function (1) where Area is the area of the chip, Bus is the wire length corresponding to wires connecting components on a bus, Total is total wire length for all connections on the chip (including inter-bus connections), and , , are adjustable weights which are used to bias the solution. The floorplanner outputs a nonoverlapping placement of components from which the wire lengths can be calculated by using half-perimeter of the minimum bounding box containing all terminals of a wire (HPWL) [20] .
Once the wire lengths have been calculated, the delay estimation engine is invoked. The wire delay is calculated based on formulations proposed in [21] . The inputs to this stage are the wire lengths from the floorplanner and the capacitive loads of component output pins (obtained from RTL synthesis). We can simplify the multiple pin net problem (which is representative of a bus line) depicted in Fig. 2 (a) to multiple two pin net problems, as shown in Fig. 2(b) . Then the delay for a wire of length , with optimal wire sizing (OWS) [21] , is given as (2) where , , and is Lambert's function defined as the value of which satisfies .
is the resistance of the driver, is the wire length, and are capacitive loads which are calculated as shown in Fig. 2 (c) and the rest of the parameters are dependent on the process technology used, is the sheet resistance in , is unit area capacitance in , and is unit fringing capacitance in (defined to be the sum of fringing and coupling capacitances). The values for these technology dependent parameters are listed in Table I , and have been calculated from [22] .
The delay estimation engine is ultimately used to check for bus cycle time violations in the design. This is illustrated through an example. Fig. 3 shows a floorplan for a system where IP1 and IP2 are connected to the same bus as ASIC1, Mem4, ARM, VIC, and DMA, and the bus has a speed of 333 MHz. This implies that the bus cycle time is 3 ns. For a 0.13-m process and a driver resistance value of 0.4 k , the floorplanner finds a wire length of 9.9 mm between pins connecting the two IPs to the bus, with p and p for the wire. The wire delay, obtained by inserting these values in (2) , is found to be 3.5 ns, which clearly violates the bus clock cycle time constraint of 3 ns. In this way, our floorplanning and wire delay estimation engines can determine if a synthesized design has buses with clock cycle timing violations. Typically, once such violations are detected at the physical implementation stage in the design flow, designers end up pipelining the buses by inserting latches, flip-flops, or register slices on the bus, in order to meet bus cycle time constraints. However, we found that such pipelining of the bus can not only have an adverse effect on critical path performance, but also requires tedious manual reworking of RTL code and extensive reverification of the design, which can be very time consuming. As we will show later, our synthesis flow attempts to automatically eliminate such violations early in the design flow at the system level once they are detected.
F. Synthesis Approach
In this section, we describe our bus architecture synthesis approach. First, we will present a few definitions that will be used later when explaining the synthesis flow in more detail.
Definitions: Let be a Communication Throughput Graph, where is the set of vertices, each of which represents a component (a master or a slave) in the design, and is the set of edges used to connect the components in that need to communicate with each other.
is the set of slave components in , where
. is the set of memory components in , such that . is a set of slave leaf components (i.e., slave components with a single incident edge connecting them to a single master component) in the , and where master refers to the master connected to the leaf component . Next, let be a superset of all throughput constraint paths in a , where each in is itself a set of vertices representing the components that are part of the , as discussed previously in Section III-B.
is the set of master components and is the set of slave components in the constraint path , such that . We now describe our automated synthesis approach in detail. Fig. 4 gives a high level overview of the flow. The inputs to the flow include a Communication Throughput Graph, a target bus-based communication architecture (e.g., AMBA), a set of Communication Parameter Constraints , and a library of behavioral IP models. The general idea is to first perform preprocessing transformations on the CTG to improve the performance of the entire system (preprocess) and then map all the components from the CTG to a simple bus topology of the target bus-based communication architecture. Then, we iteratively select a Throughput Constraint Path (TCP) from set , starting from the TCP with the most stringent constraint, and search the communication parameter space for a suitable parameter configuration (explore_params) and possibly perform topology mutations if needed (mutate_topology) till the TCP constraint is satisfied. Once all TCP constraints are satisfied, we optimize the design (optimize_design) to further lower the cost of the system. Next, we invoke the floorplanning and delay estimation engines to detect bus cycle time violations. If timing violations are detected, we update with the TCPs having components on the buses with violations, and use a feedback loop to re-enter the flow to repeat the topology mutation and parameter exploration phase to eliminate these violations or proceed to output the synthesized system and floorplan once there are no violations. Fig. 5 shows the pseudocode for the preprocess stage. In the first step we map the components in the CTG from the behavioral IP library database to a bus protocol-independent, transaction-level simulation model in SystemC [24] having a virtual channel for every edge in the graph. This model has no contention since there are no shared channels and also because we assume infinite ports at IP interfaces. The purpose of this step is to obtain, through simulation, a memory usage profile (Step 2). Once we have obtained this profile, we attempt to split those memory nodes for which different masters access nonoverlapping regions (Step 3). Finally we merge local slave nodes with their master nodes to reduce contention and loading on shared buses (Step 4). Note that we perform Step 3 before Step 4 because it allows us to generate local memories which can then be merged with their corresponding masters. Fig. 6(a) -(c) illustrates this process. The CTG shown in Fig. 6(a) is taken through the preprocess procedure and the MEM2 node is split, as shown in Fig. 6(b) , into two nodes (MEM2a and MEM2b), since CPU1 accesses a region of memory which is distinct from that accessed by masters M2 and M3. Finally, the leaf slave nodes for CPU1 (slave nodes Mem2a and S4) are merged with CPU1 into a hypernode, as shown in Fig. 6(c) .
After the preprocess stage, all the components in the enhanced CTG and the selected bus architecture are mapped from the IP library database to the fast transaction-based bus cycle-accurate simulation model (Section III-C) with a simple bus topology; a single shared main and a single shared peripheral bus. As mentioned earlier, every node in a CTG has information relating to the type of bus it can be connected to, which guides the mapping process. A bus can be considered to be a partition of nodes in a CTG, such that . Fig. 6(d) shows the mapped components on the main and peripheral bus partitions, for the preprocessed CTG in Fig. 6(c) .
Once the simple topology has been created, we select the largest unsatisfied TCP constraint from set and search for a suitable combination of communication parameter values to satisfy the constraint in the explore_params stage ( Fig. 4 ). Fig. 7 gives the pseudocode for this procedure. The explore_params procedure searches for a suitable combination of parameter values which satisfies the TCP constraint under consideration, for the current bus topology. The parameter values are bounded by the constraint set specified by the designer. However, the exploration space arising from the combinations of the bounded values can still be very large. In the interest of achieving practical running times, we must further prune this space.
We start by decoupling the bus widths and speeds from the arbitration schemes and DMA burst sizes. We set the bus widths and speeds to the maximum allowed values set by the designer in (Step 1). We do this because if TCP constraints are not met for the maximum values of bus widths and speeds, they will certainly not be met for lower values of these parameters. We cannot, however, set the DMA burst size to its maximum value and the arbitration priority to a fixed value, and make the same guarantee. Therefore, Step 1 allows us to quickly prune only the bus width and speed parameter space. Next, we select a combination of a valid arbitration priority ordering and DMA burst size, and then proceed to simulate the design (Steps 2 and 3). The best result configuration in Step 3 is the combination of parameters for which the least number of TCP constraints are violated and the throughput for the TCP being considered is the highest. The set of valid arbitration priorities is governed by the following rules: a) priorities of masters in TCPs with larger throughput constraints are always greater than priorities of masters in TCPs with lower throughput constraints; b) once a TCP constraint is satisfied, the relative arbitration priority ordering for masters in the TCP is updated (Step 5) and not changed anymore; and c) only combinations of priority orderings within the TCP under consideration need to be explored if the previous two rules are followed. These three rules reduce the large arbitration space and make it more manageable. The set of valid DMA burst sizes is governed by the following rule: once a TCP constraint is satisfied, only those DMA burst size values which did not violate the satisfied TCP constraint are considered for subsequent TCPs. Thus, as TCP constraints are satisfied, the set of valid DMA burst size values shrinks, reducing the DMA burst size exploration space. Fig. 7 shows how once a TCP constraint is satisfied, we simulate the design for different DMA burst size values to generate an updated set of allowed DMA burst sizes (Step 6), which will be used for subsequent TCP explorations.
If the TCP constraint is not satisfied for any combination of communication parameter values, we attempt to change the communication topology in the mutate_topology stage. Fig. 8 shows the pseudocode for this procedure. To meet TCP constraints, we need to eliminate conflict on shared buses, and this can be done by creating a new bus and migrating IPs, from the TCP being considered, iteratively to the new bus until the conflict is resolved.
In mutate_topology, we first check to see if this is the first time that the procedure has been called, and if so, then we create a new bus, choose an unselected master at random, and migrate the master to the new bus (Steps 2 and 6). If it is the first time that the procedure has been called, then none of the masters in have been previously selected for migration, and the function call NoneSelected returns a true Fig. 7 . explore_params procedure. value. In subsequent invocations of mutate_topology, we iteratively migrate the slaves in to the new bus (Steps 3 and 7). The function call AllSelected returns a false value if there are any remaining slaves in which have yet to be selected for migration. Once all slaves in have been considered for migration and the TCP is still not satisfied, we check for unselected masters in the current TCP (Step 4). If there are still unselected masters remaining, we undo all slave migrations since the last master migration by calling UndoNodeMigration , mark the slaves as being unselected, and migrate a randomly chosen previously unselected master to the new bus (Steps 4 and 6). In subsequent invocations of mutate_topology, we again migrate the slaves to the new bus (Steps 3 and 7) . After all masters and slaves in the current TCP have been moved to the new bus or at least considered for migration, it is possible that the TCP constraint is still not met (Step 5). In that case, we mark all the master and slaves in the TCP as unselected, randomly select a master on the previously created bus and permanently assign it to that bus, create another bus and starting from a randomly selected master (or Fig. 9 . optimize_design procedure. a randomly selected slave if there are no more masters to migrate), we iteratively migrate IPs to that bus (Steps 5 and 6). In this way, new buses are created until enough bandwidth is available to satisfy the TCP constraint. Note that if a topology mutation causes the best result configuration from explore_params to violate any previously satisfied TCP constraints, we undo the mutation (Step 1). Otherwise we keep the mutation, even if it deteriorates current TCP performance slightly. This allows us to take into account the effect of local minima in the exploration phase. Fig. 6 (e)-(h) illustrates the topology mutation process, starting from the simple bus mapping in Fig. 6(d) . The components in the TCP are shown in gray;
and . The result of the first invocation of mutate_topology is shown in Fig. 6(e) , which depicts a newly created bus onto which the CPU1 master has been migrated. Subsequent calls to the procedure iteratively migrate the rest of the components in the TCP to the new bus. However, the TCP constraint is not satisfied for any of the intermediate topologies, due to data traffic conflicts on both the main1 and main2 buses, even when all the components in the TCP have been migrated to a separate bus, as shown in Fig. 6(f) . Therefore, we proceed to create another bus (main3) and first migrate a master (CPU1) as shown in Fig. 6(g) , followed by slaves in the TCP. For the configuration shown in Fig. 6(h) , after MEM1 has been migrated to the new bus, the throughput constraint is found to be satisfied, and no more topology mutation is required, unless there is a timing violation detected by the floorplanning and wire delay estimation engine later in the flow (Fig. 4) .
Once all the TCP constraints are satisfied, we arrive at the optimize_design stage. The pseudocode for this stage is shown in Fig. 9 . The purpose of this stage is to reduce the maximum values we selected earlier for bus widths and bus clock speeds. Here we iteratively consider each bus in the system and attempt to lower the value for data bus width (Step 2) and bus clock speed (Step 4), without violating any TCP constraints. Reducing the bus width reduces the number of wires in a bus and lowers the cost of the system. Reducing the bus speed on the other hand, reduces the probability of a bus cycle time violations since it lengthens the bus clock cycle time period. The order in which the bus width or the bus speed is reduced is flexible and is left to the designer.
Next, we pass the optimized system through our floorplanning and wire delay estimator engine. For the system shown in Fig. 6 , we pass the final modified CTG shown in Fig. 6(h) to the engine. If a timing violation is detected (as discussed in Section III-E), the set is updated with TCPs which have components on the buses with violations, and we use a feedback loop to go back and attempt to eliminate these violations. Since the cumulative capacitive load of components directly contributes to increasing signal propagation delay (Section III-E), we attempt to reduce the number of components on the bus having a violation. Therefore, when we go back into the flow using the feedback loop, we first select the TCP from which has components on the violated bus with the largest load capacitance on its pins, and iteratively migrate them to another existing bus (or a new bus if migration to existing buses causes TCP constraint violations). If there is still a violation, we select another TCP from and migrate components from that TCP away from the violated bus. We also give higher priority to reducing bus clock speed over reducing data bus width in the optimize_design stage, since reducing bus clock speed improves the probability of meeting the bus clock cycle period constraint. Note that the solution is guaranteed to converge when we use a feedback path. This is because in the worst case we end up creating a new bus (to migrate components away from the violated bus), which increases the cost of the system, but as a tradeoff we get improved system performance (even after we consider bridge overhead delays) and the ability to meet bus cycle time constraints.
Finally, after all violations have been resolved and all TCP constraints satisfied, we output the final synthesized bus topology, parameter values for bus speeds, data bus widths, DMA burst size and arbitration priority ordering, along with the feasible floorplan. For the system shown in Fig. 6 , the final synthesized architecture looks like the one shown in Fig. 6(i) .
IV. CASE STUDIES
We applied our automated bus-based communication architecture synthesis approach on three industrial strength designs from the network communication domain. In the first case study, we selected a network communication SoC subsystem used for fast data packet processing and forwarding. Fig. 10 shows the CTG for this system. There are two data manipulation related TCP constraints that must be satisfied in this system. The first TCP involves the encryption engine and includes the ARM926, ASIC1, RAM3 and EXT_IF blocks. The EXT_IF block fetches data and stores it in RAM3. The ASIC1 and ARM926 blocks fetch nonoverlapping sections of the data, process them, and store them back in RAM3, from where the EXT_IF block fetches and streams them out at a minimum rate of 200 Mb/s. The second TCP involves the USB subsystem. Data packets received at the USB are routed to RAM1. The ARM926 reads this data, processes it, and stores it back to RAM1 from where the DMA engine transfers it to SDRAM_IF, which streams it out at a minimum rate of 480 Mb/s. There is also a third subsystem which involves the SWITCH, RAM2 and ARM926 components. However, this is a very low priority data path which has no data rate constraint from the designer, and, therefore, we do not classify it as another TCP to be satisfied. Table II shows the Communication Parameter Constraint set for this case study. The target communication architecture for the automated synthesis is the AMBA2 high performance AHB bus and a low bandwidth APB bus [2] . For the floorplanner, we give maximum priority to minimizing wire length for components on a bus, and equal lower priorities for area and total wire length minimization. Fig. 11 shows the final output of our synthesis flow; a synthesized architecture which meets all throughput and timing constraints. The values for the generated communication parameters are given in Table III and the final floorplan for this system is shown in Fig. 12 . The automated synthesis engine initially created 2 AHB buses, with the SWITCH and RAM2 components connected to AHB1, which was assigned a clock speed of 200 to meet the encryption path throughput constraint. However, the floorplanning engine detected a cycle time violation for the bus due to excessive capacitive loading. The topology_mutate stage then split the shared AHB bus and assigned the ARM926, ASIC1, and EXT_IF masters and their associated slaves to one bus, and the SWITCH and RAM2 components to another AHB bus, to reduce capacitive loading. Finally, the optimize_design function reduced the bus speeds for the AHB buses from 200 to 133 MHz and the APB bus to 66 MHz, to lower the cost of the system. Both the throughput constraints were still met at these lower bus speeds. The synthesis engine made a simple assumption and assumed a 133-MHz bus speed for AHB3 to simplify the design of BRIDGE3 to AHB1, but a designer can choose to further lower the AHB3 bus speed if a more complex bridge is acceptable.
For our second case study, we considered a derivative of the network communication subsystem from Fig. 10 , which extends and partially modifies the functionality of the previous system. Fig. 13 shows this derivative architecture, which has an additional TCP constraint involving the ARM926, SWITCH, RAM2, and two newly added components: a memory array (RAM4) and an ASIC block (ASIC2). In this TCP, data packets received from the SWITCH are stored in RAM2. These packets are retrieved by ASIC2, which reads and modifies some protocol header information before storing it back to RAM4 from where the SWITCH must stream it out at a minimum data rate of 3.2 Gb/s. The ARM926 is used minimally, for directing data flow in this TCP.
The Communication Parameter Constraint set for this case study is shown in Table IV and is slightly modified from  Table II , with the addition of a larger data bus width value of 64, to handle the increased bandwidth requirements. Also, instead of using the AMBA2 AHB bus architecture like in the previous case, we modify the target communication architecture to AMBA3 AXI [25] . Our synthesis flow outputs the architecture shown in Fig. 14. The values for the generated communication parameters are shown in Table V and the final floorplan is shown in Fig. 15 . Since AXI supports separate channels for reads and writes, the bus speeds required to maintain throughput are lower (100 MHz). The AXI3 bus which supports the SWITCH TCP has a 64-bit data width and a high 200-MHz bus clock speed in order to maintain the high data flow rate.
For the third case study, we chose a multiprocessor system (MPSoC) networking subsystem. Fig. 16 shows the CTG for the system. For clarity, the TCPs are presented separately in Table VI . The Communication Parameter Constraint set is shown in Table VII . The target communication architecture for the synthesis process is the AMBA2 AHB bus architecture.
ARM1 is a protocol processor (PP) while ARM2 is a network processor (NP). The ARM1 PP is responsible for setting up and closing network connections, converting data from one protocol type to another and exchanging data with the NP using shared memory. The ARM2 NP directly interacts with the network ports and is used for assembling incoming packets into frames for the network connections, network port packet/cell flow control, keeping track of errors, and gathering statistics. The ASIC1 block performs hardware cryptography acceleration, while ASIC2 and ASIC3 are used for other data packet and frame processing. The DMA is used to handle fast memory to memory and network interface data transfers, freeing up the processors for more useful work.
The synthesis process first generated the system shown in Fig. 17(a) . However, once we passed the architecture through the floorplanning and wire delay estimation stage, it was discovered that the system was not feasible because of the excessive cumulative load capacitance on the AHB1 bus, which caused a timing violation. Fig. 17(b) shows the floorplan layout for this configuration. The synthesis process records this violation, and resynthesizes the communication architecture to come up with the architecture shown in Fig. 17(c) with a reduced capacitive loading on AHB1 while still satisfying all TCP constraints. This architecture does not violate any bus cycle time constraints and the final floorplan is shown in Fig. 17(d) . Note that the synthesis process splits the SDRAM2 and MEM4 components, moving portions of both these components to a local bus of the ARM2 processor. This reduces unnecessary traffic and capacitive loading on the shared AHB bus. The synthesized communication parameter values are shown in Table VIII . Since most of the streamed data was native 32 bits, a higher 64-bit bus width did not affect the performance significantly and the synthesized buses all have 32-bit data bus widths.
We now compare the quality of our synthesis process. Since none of the existing synthesis approaches are aimed at detecting bus cycle time violations early in the design flow, there is no direct point of comparison. We chose to compare the quality of our synthesized designs with an approach which maps all the components in the application to a single main/peripheral shared bus (initial), an automated bus architecture synthesis flow which does not use a high level floorplanner (ABS), and a manually intensive, high level synthesis effort by a designer which also makes use of a floorplanning and wire delay estimation engine to detect timing violations (manual) just like our floorplan-aware automated bus architecture synthesis approach (FABSYN). The manual synthesis approach involves a designer manually selecting a combination of bus topology and communication parameter values, simulating the high level design models in SystemC and then iteratively modifying the bus architecture and parameter values based on the simulation results and designer intuition, until all constraints are found to be satisfied. Table IX compares the results from our synthesis approach for the three case studies with the results from the other approaches. The initial approach is unable to satisfy any of the TCP constraints for all three of the case studies, because of excessive data traffic conflicts on its restricted number of buses. In contrast, the ABS approach does manage to satisfy TCP constraints for all the case studies, but in each case it synthesizes a bus architecture with bus clock cycle time violations that remain undetected, and, thus, the synthesized architecture is not feasible in each case. The manual approach satisfies all TCP constraints and is also able to detect and eliminate bus clock cycle time violations in the design, just like our FABSYN approach. However, there are a few key differences between the manual approach and our FABSYN approach. First, the manual approach generates bus architectures having a greater implementation cost (i.e., having a larger number of buses) when compared with architectures generated using our approach. This is because our automated flow is able to traverse a much larger communication parameter exploration space than the manual approach, and prevents us from making conservative decisions to create a new bus like in the manual approach, unless all suitable combinations of communication parameters are unable to meet the TCP constraint for the existing bus topology. Second, the performance of the architecture generated by the manual approach is actually found to be better than our FABSYN approach (except for the third  TABLE IX  SYNTHESIS RESULT COMPARISON case study, where frequent bridge delay overheads reduce performance). This is because of the larger number of buses used by the manual approach, which reduces data traffic conflict and improves concurrency, at the cost of increasing the implementation cost. But it is important to note is that we are not really concerned about the absolute performance of the system. What is important to us is that we satisfy all TCP constraints and minimize the implementation cost of the synthesized architecture, and that we do so in a reasonable amount of time. The manual approach suffers from the major drawback that it takes several days for the designer to come up with a bus architecture which is typically overdesigned and exceeds the requirements (resulting in a more expensive system), whereas our FABSYN approach generates a better quality architecture in a matter of a few hours.
V. CONCLUSION
In this paper, we presented an approach for automating the synthesis of bus-based communication architectures for systems characterized by several possible throughput constraints. Our approach synthesizes a low-cost bus topology and generates values for bus architecture parameters such as arbitration priority ordering, bus widths, bus speeds, and a DMA burst size, required to meet the performance constraints in the design. In addition, we use a high level floorplanning and delay estimation engine to generate a layout of the components on the chip, and detect bus cycle time violations early in the design flow at the system level. Results from the automated synthesis of AMBA based bus architectures for the network communication subsystem case studies show the usefulness of our approach. Our approach reduces the exploration and design time by at least an order of magnitude when compared to a manual effort, while also guaranteeing feasibility of physical design. Furthermore, our approach is easily portable across different standard bus-based communication architectures, such as CoreConnect [3] and OCP [4] , and can be extended to automatically synthesize other bus architecture specific parameters such as out-oforder (OO) buffer sizes as well. Future work will focus on extending the FABSYN approach to crossbar based communication architectures.
