Abstract-The performance of a multiprocessor system heavily depends upon the efficiency of its bus architecture. This paper presents a methodology to generate a custom bus system for a multiprocessor system-on-a-chip (SoC). Our bus-synthesis tool, which we call BusSynth, uses this methodology to generate five different bus systems as examples: 1) bidirectional first-in first-out bus architecture; 2) global bus architecture (GBA) version I; 3) GBA version III; 4) hybrid bus architecture (Hybrid); and 5) split bus architecture. We verify and evaluate the performance of each bus system in the context of three applications: an orthogonal frequency division multiplexing wireless transmitter, an MPEG2 decoder, and a database example. Our methodology gives the designer a great benefit in the fast-design space exploration of bus architectures across a variety of performance impacting factors such as bus types, processor types, and software programming style. In this paper, we show that BusSynth can generate buses that, when compared to a typical general GBA, achieve superior performance (e.g., 41% reduction in execution time in the case of a database example). In addition, the bus architecture generated by BusSynth is designed in a matter of seconds instead of weeks for the hand design of a custom bus system.
I. INTRODUCTION
S TATE-OF-THE-ART chip-design technology enables system-on-a-chip (SoC) to open up new opportunities for very large scale integration (VLSI) hardware design. The ability of the semiconductor industry to continually live up to Moore's prediction [1] makes it practical to put multiple processing elements (PEs) on a single chip. In particular, single-chip integration allows the designer to take advantage of increased bus speed and width. Thus, an effective bus architecture with efficient arbitration for reducing contention among multiple PEs plays an important role in maximizing the performance of an SoC.
In the design of an SoC, one obvious issue for an SoC designer to consider is how to exchange data among multiple PEs in the SoC. For instance, should there be one bus or multiple buses and where should memory elements be placed? Another issue for an SoC designer to consider is how to easily and quickly design a bus system considering the increasing complexity of on-chip bus systems and in the context of ever shortening time to market demands. These issues have motivated the introduction of a design automation tool that is capable of generating customized SoC bus systems in hardware description language (HDL) code to speed up a user's design space exploration in search of a high-performance bus system. This paper presents a methodology to generate custom bus systems using predesigned reusable hardware modules for a multiprocessor SoC. The hardware modules are described using Verilog HDL. Using this methodology, five different bus systems [2] , [3] are generated as examples in synthesizable Verilog HDL: 1) bidirectional first-in first-out (Bi-FIFO) bus architecture (BFBA); 2) global bus architecture version I (GBAVI); 3) GBA version III (GBAVIII); 4) hybrid bus architecture (Hybrid) that combines BFBA and GBAVIII; and 5) split bus architecture (SplitBA). Each bus system performance is evaluated using three applications: an orthogonal frequency division multiplexing (OFDM) wireless transmitter, an MPEG2 decoder, and a database example. We also show that our bus-synthesis tool (BusSynth) can efficiently generate a large variety of bus systems in a matter of seconds (as opposed to weeks of design effort to put together each bus system by hand). Furthermore, we compare the performance of each bus system with a simple general GBA (GGBA) or an industry standard on-chip bus (CoreConnect from IBM [4]), showing up to 41% reduction in application execution time with a customized bus architecture. This paper is organized as follows. Section II shows related work, and Section III explains some of the terminology applied to describe our approach. Section IV depicts a bus system structure and several custom bus system examples. Section V presents a detailed description of the methodology. In Section VI, we explain applications used to evaluate the generated bus systems and then show experimental results. Finally, we conclude this paper in Section VII.
II. RELATED WORK
VLSI multiprocessor SoC designers face the following constraints: shorter time to market, ease of design, correctness of design, huge gate counts, and high performance requirements. Specifically, these have been important issues which have been approached through design automation using computer-aided design (CAD) tools.
Most SoC bus designs are based on predesigned reusable components stitched together with various forms of data, address, and control links. Several efforts from industry provide platforms to connect the predesigned components used in an SoC: CoreConnect [4] from IBM, AMBA [5] from ARM, CoreFrame [6] , [7] from Palmchip Company, and SiliconBackplane 0278-0070/04$20.00 © 2004 IEEE or CAN. Thus, BusSynth targets SoC designs where direct, nonpacket-based connections are desired. For this reason, BusSynth focuses on generating hardware blocks of dedicated bus logic for application-specific communication including handshake registers and bus arbiters for a customized bus architecture. This contrasts with the work of Pai Chou et al., which did not generate customized SoC bus architectures, but rather assumed that such bus architectures are already available (e.g., a CAN bus).
Several efforts [15] - [21] from TIMA Lab present a component-based design flow for a heterogeneous multicore SoC. Their design flow introduces a systematic method of wrapper generation for multicore SoC design based on architectural parameters extracted from a high-level system specification. Lyonnard et al. [15] introduce a design flow for the generation of an application-specific multiprocessor architecture. They use a generic multiprocessor architecture template to support two types of buses (e.g., point-to-point connection and shared bus) and a communication coprocessor for the interface between a processor and a bus. To interface each heterogeneous component to another part of system, they depict a generic wrapper architecture that adapts to different communication protocols and abstraction based on automatic wrapper generation [16] - [18] . Cesário et al. [19] , [20] and Nicolescu et al. [21] describe a component-based design environment to enable an automatic wrapper generation tool to support hardware interfaces, device drivers and application programmer interfaces (APIs).
Shin et al. [22] show how an optimal configuration of a parameterized on-chip system bus could be found using a software tool they developed. They, however, do not discuss generation of various bus communication topologies based on user requests.
We, on the contrary, focus on bus architecture in the component-based SoC design and provide a more flexible bus-architecture template to generate bus systems. The template supports multiple and heterogeneous bus architectures (e.g., GBAVI, GBAVIII, BFBA, Hybrid, and SplitBA) in a system and various optimized wrappers (e.g., CPU-, memoryand generic-bus interfaces).
III. TERMINOLOGY
Before proceeding to discuss our bus synthesis tool (BusSynth), we first explain some of the terms we will be using to describe the different components of a bus architecture. Example 1 explains some of the terminology we have defined.
Definitions:
1)
Processing Element (PE): a hardware unit that performs algorithmic processing,usually a CPU, but it may also be dedicated or reconfigurable logic.
2)
Bus Bridge (BB): a hardware unit that is an on-off controllable connection point between two buses. If the BB is enabled, the two buses are fully connected; otherwise, the two buses are disconnected. Note that our BB does not currently support different bus speeds in buses connected by the BB.
3)
Global Bus Architecture (GBA): a type of bus architecture having a bus through which all PEs can access 
4)
Bi-FIFO Bus Architecture (BFBA): a type of bus architecture where bidirectional FIFOs are used to transmit and receive data between adjacent PEs.
5)
Segment of Bus (SB): a contiguous bus (no BBs) consisting of address, data, and control (e.g., read enable, write enable, request, and acknowledge) wires specific to a particular bus type (in our case, BFBA).
6)
Bus Access Node (BAN): An integrated hardware block that is composed of at most one PE, custom hardware blocks and/or memory hardware together with associated bus access hardware and SB(s).
7)
Module: a hardware unit such as PE, BB, SB, an arbiter, SRAM, or interface logic blocks, where the specific interface logic blocks will be explained in more detail in Section IV-A. Note that it is possible to extend the definition of module to include newly designed hardware units that carry out specific functions. For this paper, however, the definition given for module suffices.
8)
Bus Subsystem: a subsystem that consists of one or more BANs connected together using the same bus or the combination of different bus architectures (in our case, either GBA, BFBA, or the combination of GBA and BFBA).
9)
Bus System: a system that consists of one or more bus subsystems connected together.
Example 1: Terminology: Fig. 1 shows an SoC consisting of four PEs (MPC755s), each with an L1 cache. Each MPC755 is an example of a PE. In the bottom right of Fig. 1 can be seen an SB used to connect BAN J to the rest of the SoC. Note the use of interface logic blocks (ILs) to connect MPC755 J to the bus system. The bottom right of Fig. 1 also shows MPC755 J connected to local SRAM and an SB to form bus access node J (BAN J). In BAN J, each block such as SRAM, IL2, IL4, or SB is a module. BAN J is adjacent to BAN I, and the BANs I and J together form a bus subsystem using bus type BFBA for communication. On the left-hand side of Fig. 1, BANs A , B, and G form another bus subsystem in which GBAVIII is used for communication. A BB connects the two bus subsystems as shown in the top middle of Fig. 1 . On the whole, Fig. 1 shows an example of a bus system composed of two bus subsystems.
IV. BUS SYSTEM STRUCTURE AND EXAMPLES
In this section, we will begin by describing our bus system structure. With this structure as a basis, we will show the generation of several bus systems, as exemplified in the following subsection. Fig. 1 shows an example of a hierarchically structured multiprocessor bus system that has two bus subsystems with two and three BANs, respectively. A bus system is composed of one or more bus subsystems, and each bus subsystem includes one or more BANs, each of which is composed of PEs, hardware modules and/or memories together with associated control logic. Here, the bus subsystems are connected through bus bridges. This kind of hierarchical definition allows a bus system to have a flexible and scalable bus architecture in a multiprocessor SoC bus system design. Fig. 2 depicts a more detailed version of the bus subsystem shown on the left-hand side of Fig. 1 . In addition to PEs (e.g., MPC755) and memories (e.g., SRAM) in the BANs of Fig. 2 , additional modules are specified as interface logic (IL): CPU (or PE) to bus interface (CBI), memory to bus interface (MBI), and generic bus interface (GBI). With these ILs, each BAN can have different types of PEs, hardware modules and/or memories because the ILs enable the heterogeneous modules to adapt to one another. For example, BAN A can have MPC755 and SRAM while BAN B can have ARM9TDMI and DRAM. Similarly, GBI also provides flexibility in selecting various types of buses for a bus subsystem (e.g., GBAVIII and BFBA). Each BAN can access any other BAN's memory through a bus integrated with several SBs. Based on the bus system structure, by simply repeating generated BANs, a bus subsystem can be a scalable structure, and a multiprocessor bus system can be implemented in an easy manner.
A. Bus System Structure
When a bus subsystem has a global resource such as a large global memory to be accessed from all BANs, the resource is also defined as a BAN: for example, BAN G in Fig. 2 . On the other hand, the bus system structure shown Fig. 1 and the bus subsystem structure shown in Fig. 2 allows the user to adapt a standard commercial bus architecture (e.g., AMBA). As shown in Fig. 2 , ILs adapt hardware units (e.g., arbiter, SRAM, and MPC755) to specific buses (e.g., GBA). Thus, if our module library that will be described in Section V-A provides the wrappers (i.e., ILs) for the various possible buses (e.g., a Bi-FIFO bus or an AMBA bus), our approach enables the user to choose custom bus topologies in a bus system. In support of the choice, BusSynth will generate custom Verilog HDL at the RTL as will be described in Section V.
B. Bus System Examples
In this section, we show five custom bus systems generated by BusSynth automatically: BFBA, GBAVI, GBAVIII, Hybrid, and SplitBA. All bus system examples shown in Figs. 3-9 have four PEs, a total of 32 MB of non-L1 cache memory, and a total of 256 KB of L1 cache memory (all examples have approximately the same chip area because the area of the bus logic and wires is much smaller than PE and memory area). Ordinarily, BusSynth can generate a bus system having any number of PEs and any sizes of memories according to the user options (in Section V-B, we will describe how the user inputs the options). In these examples, we use the Motorola PowerPC (MPC755) for the PE core, which, however, can be changed to other cores simply by adding a CBI module for the new PE core (e.g., ARM9TDMI) to be operated in the bus system.
First, we give a detailed explanation of the five sample custom bus architectures generated by BusSynth in this paper (BusSynth can generate a very large number of custom bus architectures). GBAVI, shown in Fig. 3 , is a kind of GBA, but the global bus is segmented by BBs (e.g., BB 2, 4, 6 and 8) separating each BAN, where the number of BANs is specified by the user. Each BAN has an SRAM block (e.g., SRAM A, SRAM B, SRAM C, or SRAM D). One BB in each BAN controls a possible bus connection between the PE side bus and the SRAM side bus in each BAN: BB 1 between CBI MPC755 and MBI SRAM in BAN A. Thus, in GBAVI, a group of two adjacent BANs can exchange data without any bus conflict with the other BANs in the SoC at the same time thanks to separation provided by the BBs. For example, in Fig. 3 , while BAN A and BAN B communicate with each other, BAN C and BAN D also can communicate at the same time without any bus conflict. Each group of two BANs in Fig. 3 is synchronized by handshaking using shared registers (HS REGS) between BANs (see [38] and [39] ). Note that GBAVI tends to work well in a pipelined style operation; for example, the output of a PE (e.g., MPC755 A) is passed to the next PE (e.g., MPC755 B).
As shown in Fig. 4 , BFBA has a Bi-FIFO between adjacent BANs. This design is similar to some commercially available multiprocessor printed circuit boards (PCBs) such as a Quad TMS320C6701 Processor VME Board from Pentek [23] . One BAN can push data into a Bi-FIFO while an adjacent BAN can read the data from the Bi-FIFO. In this way, the PEs can carry on successive functions for a pipelined operation. A specific way to communicate over the PEs in Fig. 4 is presented in [38] and [39] . Note that BFBA also works well in a pipelined style of operation.
GBAVIII, shown in Fig. 5 , is a GBA with a global arbiter and a global memory. When any BAN tries to access the global memory through the global bus, the global arbiter resolves the case of multiple memory requests from the BANs. The arbiter has a first-come-first-serve (FCFS) scheduling scheme using a FIFO, but the arbiter can have a different policy such as a priority-based protocol. The Global SRAM in Fig. 5 can also be replaced with another memory type by using its corresponding MBI, which adapts the interface between the memory and the bus. The local memory in each BAN can be used for relatively faster memory access than the global memory due to arbitration time. How to communicate among BANs in Fig. 5 is shown in [38] and [39] . Also, please note that GBA version II (GBAVII) was presented in [1] but was not chosen for automated generation in this paper due to space constraints; however, if desired, the GBAVII bus could easily be added to our tool.
Another possible bus system, Hybrid, is the combination of BFBA and GBAVIII, as shown in Fig. 6 . This combination allows the bus architecture to exploit the advantages of both BFBA and GBAVIII by: 1) supplying a Bi-FIFO data transfer method between adjacent BANs and 2) having a global memory area that can be accessed from all BANs. This combination of features gives flexibility in communication and thus results in a higher performance, although a penalty is paid in increased chip area (see Table IX in Section VI-C for details). Fig. 7 can operate at the same time without bus contention so that system performance is increased. In addition, in each bus subsystem, a bus length relatively shorter than using a single GBA makes the system be speedy and even consume lower power due to lower parasitic resistance and capacitance in the buses in the SoC [24] . Due to its divided bus, SplitBA also relieves bus traffic congestion caused by shared memory requests from each BAN.
CoreConnect bus architecture (CCBA) from IBM and general GBA (GGBA) are shown in Figs. 8 and 9, respectively. These bus architectures are designed by hand and are used as a baseline for performance comparisons with our generated bus systems. 
V. METHODOLOGY FOR BUS SYNTHESIS
Based on the bus system structure described in Section IV-A, our bus synthesis tool BusSynth generates the bus system examples shown in Section IV-B using two kinds of libraries: module library and wire library. In this section, we show the methodology behind our approach to generate a custom user-specified bus system. In the first Section A, we show how the module library and the wire library are made and work in the tool. Then, in the second Section B, we explain how to generate bus systems using the libraries. Thus, the second (and final) subsection of Section V covers the main point: detailed methodology, pseudo code and algorithmic analysis for bus synthesis.
A. Libraries for Module Repository and Wiring
BusSynth uses two libraries to generate a bus system. One is a module library that contains all modules currently supported for use inside a BAN, a bus subsystem and/or a bus system. The other library is a wire library that contains many different specific wires for connecting the modules inside BANs, bus subsystems and a bus system.
The module library contains not only input/output (I/O) port information and behavior of each module in RTL Verilog code, but also many templates to generate specific modules (e.g., ARBITERs). Here, the templates have parameters to configure each of the specific modules that the user wants through the user options that will be introduced in Section V-B in detail, and the modules are generated by assigning specific values to the parameters whose values are from the user input options, based on the user requests. Each library component is described in text in a file and starts and ends a specific keyword, respectively: "%module library name " and "%endmodule library name ." The parameters to be configured in a library component are specified with another specific keyword "@parameter@." These keywords are shown in Example 2 in detail. The module library contains the following components: 1) PE : a processing element, where PE is one of MPC750, MPC755, MPC7410 or ARM9TDMI; 2)
CBI PE : an interface module between a PE (or CPU) and a bus; 3) memory comp: a memory template to be used to generate any size of behavioral memory, where memory is one of SRAM or DRAM; 4)
MBI memory : an interface module between a memory and a bus, where memory is one of SRAM or DRAM; 5)
BB bb type : a bus bridge, where bb type is one of GBAVI or SplitBA; 6) ARBITER arb type : an arbiter module, where arb type is one of "Round Robin" or "Priority;" 7)
ABI: an interface module between an arbiter and a bus; 8)
GBI bus type : a generic bus interface module, where bus type is one of GBAVI, GBVAIII or BFBA; 9)
SB bus type : a module for Segment of Bus (SB), where bus type is BFBA. Example 2 shows an example of the module library and how the different parameters in each library component are taken into consideration when performing adaptation between heterogeneous hardware components (e.g., between a bus and an SRAM). Here, the different parameter values are based on the user input options.
Example 2: Module Library: As an example of a module library component, MBI SRAM is shown in Fig. 10 . This component is for the interface between an SRAM and a bus as shown in Figs. 3-7 when a user wants to attach an SRAM to the buses through the user options that will be explained in Section V-B. In Fig. 10 , the library component name is shown in the first line, "%module library name ," where library name is MBI SRAM. To specify MBI SRAM's property, there are three different parameters: physical memory address width, memory data width, and difference in bit width between bus-data width and memory-data width. These parameters are set in a module generation procedure based on the user options, where the module generation procedure using the module library will be described in Section V-B4 in detail. For the interface between CPU bus A and the 8 MB SRAM in BAN A of Fig. 4 , the parameters are set to "20," "64," and "0" for memory address width, memory data width, and bit width difference, respectively. The parameter values are from bus property (e.g., BUS D WIDTH: 64) and memory property (e.g., MEM A WIDTH: 20 and MEM D WIDTH: 64) in the user options, and the bit difference is the difference between "BUS D WIDTH" and "MEM D WIDTH." In this library, control signals for reading from and writing to SRAM are decided by pin names: reb local, sram reb, web local, and sram web. Please note that we assume that all addresses which appear on a bus are physical addresses. Any virtual addresses used by programs must be translated to physical addresses prior to placing them on the bus.
The wire library contains all possible combinations of legal connections between hardware blocks (e.g., between modules in each BAN, between BANs in each bus subsystem or between bus subsystems in a bus system). This library is written in ASCII format as shown in Fig. 11 , and there are several fields to specify connection information: 1) wire name (w name); 2)
wire width (w width); 3)
module name (mx name), where x indicates the module number, 1 or 2; 4) port name in module x (mx pname); 5) most significant bit (MSB) of wire connected to a module x (mx wmsb); 6) least significant bit (LSB) of wire connected to a module x (mx wlsb). In the wire library format shown in Fig. 11 , two modules are connected by the wire, namely, module m1 name and module m2 name. To specify a single wire connecting three or more distinct ports, an additional wire entry is needed for each additional port beyond two. Please note that the m1 pname field specifies the port to which the wire connects in module m1 name, while the m2 pname specifies the port to which the wire connects in module m2 name. Thus, in a sense, two "ports" are specified in each wire library entry! These two ports are not, strictly speaking, "part of" the wire; nonetheless, since the wire connects the two ports, the two ports are part of the wire library format. Example 3 shows wire connections between two modules within the same BAN.
Example 3: Wire Connections in a BAN: As an example of a wire connection in a BAN, consider the wires between MBI SRAM and SRAM A in BAN A of Fig. 4. Fig. 12 shows the detailed wires connecting SRAM A to MBI SRAM: w addr for address bus, w web for write enable, w reb for read enable, w csb for chip selection and w dq for data bus. To specify the wires in Fig. 12 , the wire information in the wire library is as follows: Note that the m1 name and m2 name fields may be the same when a connection specifies either: 1) a single wire between more than two ports on different modules (or BANs) or 2) a set of similarly-named wires (except for a suffix) forming a torus among more than two ports on different modules (or BANs). Example 4 shows such wire connections between different BANs. Please note that to specify a wire between/among BANs that have same I/O ports in their pin names in a bus subsystem (e.g., the connection between BAN A and BAN B in Fig. 4) , m1 name and m2 name in Fig. 11 need to be the same. This case is described in Example 4 in detail, where Fig. 13(b) shows detailed blocks and I/O pins that are related to each BAN's I/O ports shown in Fig. 13(a) .
Example 4: Wire Connections Between BANs in a Bus System: This example shows how to form wire connections between multiple (more than two) BANs in the wire library. BANs A, B, C, and D are linked as in a chain as shown in Fig. 13(a) , and the connections of the I/O ports shown in the left box of Fig. 13(b) are repeated between the BANs. In this kind of wire connection, the names of the wires connecting the BANs have the same names but with different suffixes as shown Fig. 13(a) , except for one case: reset. The reset wire does not have a suffix, and reset is the only wire that connects to all the BANs with a single contiguous wire, as opposed to just connecting one BAN to another. In the example of Fig. 13 , it is not necessary that we specify all wires individually. Thus, although the wire library format technically only supports the specification of a wire connecting two ports (from up to two different BANs), nevertheless our tool supports wire specifications such as shown in this example. The result of the wire specifications shown in this example is the serial connections (wires) linking the specified BANs by generation of wires suffixed by an enumerated number as shown in Fig. 13(a) . For this purpose, wire connections between BANs are specified by the same module names in the m1 name field and the m2 name field in the wire library format as shown in Fig. 11 ; in this example, the names are just "BAN[A,B,C,D]" as shown below. If the port name fields m1 pname field and m2 pname are different, then a torus network is described, otherwise, a simple contiguous wire is described. Here, "BAN[A,B,C,D]" means that the specified wire connecting the named ports is applied for BANs A, B, C, and D. On the other hand, the wires between BANs having connections other than a simple contiguous wire or a torus network have to be specified separately in the wire library; for example, as shown below, single explicit wires are specified connecting BAN B and BAN FFT. The connections between BAN B and BAN FFT in Fig. 13(a) show the case where we assume that BAN B has another bus to BAN fast Fourier transform (FFT) in addition to the buses connecting BANs A, B, C, and D. Here, BAN FFT is a BAN having a hardware FFT core.
Detailed wire connections between a pair of BANs A, B, C, and D in Fig. 13 (a) are as follows: w done op cs or w done rv cs for handshake register selection, w ban web for write enable, w ban reb for read enable, w fifo cs for FIFO chip selection and w data for data bus as shown in Fig. 13(a) . In the connections between BAN B and BAN FFT, the wires are as follows: w fft ad: address for FFT buffer, w fft data: data bus, w fft reb for read enable, w fft web for write enable, w fft srt for FFT start control, and w fft ack for acknowledge of FFT end. The wire connections among the BANs shown in Fig. 13(a) are specified in the wire library as follows: As stated earlier, please note that the wire library contains at a minimum all legal connections among modules, where, by a "legal" connection, we mean a connection which makes clear functional sense, e.g., between two 32-bit address ports. However, in a case where a specialized non-"legal" connection, e.g., from bit 3 of an address port to a clock input, is desired, such a case can be supported by manually entering the wire into the wire library.
To specify a port in a module, we use port direction, port name, MSB, and LSB of the port width for each port of the module. Thus, a record of port information contains the four properties port direction, port name, MSB and LSB in a data structure. In order to specify the ports in the module, a record for each port is required. Example 5 shows an example of the port information.
Example 5: Record of Port Information: Suppose that we want to describe a port "addr fft [11:0] " of BAN FFT shown in Fig. 13(b) . A record for the port information contains "input," "addr fft," "11," and "0" in a port data structure.
B. Bus Synthesis Sequence
We now show how to generate the bus systems shown in Section IV-B. First, we describe the overall flow of bus synthesis as shown on the left-hand side of Fig. 14 . Next, we explain the user options to configure the bus system to be generated from our bus synthesis tool BusSynth. Third, we describe how to generate the wires to interconnect the modules of a specific hardware unit (e.g., a BAN or a bus subsystem) that is to be part of the specified bus system. Fourth, we describe our algorithm for bus subsystem generation. Fifth, we describe our algorithm for bus system generation. Finally, we end with an analysis of the computational complexity of the algorithms we have introduced.
1) Overall Flow of BusSynth:
The flowchart on the left-hand side of Fig. 14 shows the flow of the bus-synthesis sequence. First, BusSynth takes user input options for a bus system to be generated, and then, based on the options, BusSynth generates the required BANs and then assembles them into the required bus subsystems. After that, if the bus system the user wants has more than one bus subsystem, the generated bus subsystems are integrated into the resulting bus system. Otherwise, the generated single bus subsystem becomes a bus system. Finally, BusSynth writes synthesizable Verilog HDL code for the generated bus system.
2) Detailed User Options: As the first step of the flowchart in the left hand side of Fig. 14 , to configure the custom bus system, the user enters input options according to the right hand side box of Fig. 14 . These options are input constraints used to generate a custom bus system. Several categories in these options are as follows:
1)
Bus System Property: number of bus subsystems in a bus system.
2)
Bus Subsystem Property: number of BANs, number of buses and bus type of each bus, where the bus type is one of GBAVI, GBAVIII, BFBA, or SplitBA.
3)
Bus Property: address bus width, data bus width, and Bi-FIFO depth for each bus type specified in each bus subsystem, where the Bi-FIFO depth is available only for BFBA and Hybrid.
4)
BAN Property: CPU type or Non-CPU type and number of memories for each BAN, where the CPU type is one of MPC750, MPC755 or ARM9TDMI, and the non-CPU type is one of DCT or MPEG2 decoder. Note that this can be easily extended to include new CPUs or additional predesigned reusable components (non-CPUs).
5)
Memory Property: memory type, address bus width, and data bus width for each memory specified in each BAN, where the memory type is one of SRAM, DRAM, DPRAM, or FIFO. Note that this can easily be extended to include additional memory types.
The input sequence of user options is as follows. First, the user enters the number of bus subsystems for a bus system and specifies the number of BANs and a bus type for each bus subsystem. For the bus types selected in the bus subsystem property option, the user inputs bus property options for each bus type. The CPU Type or non-CPU Type and the number of memories are inputs in the BAN property option if the user wants to have these resources in a BAN. Finally, the user inputs memory property for each selected memory in the BAN property if any memory is required in a BAN. How to use each option in a bus system is shown in Example 6.
Example 6: User Input to Configure a Bus System to Be Generated:
A user input sequence which specifies the BFBA bus system shown in Fig. 4 is as follows. The user first specifies the number of bus subsystems by entering a "1" in bus system property (user option 1 in Fig. 14) and inputs "4" for the number of BANs (user option 2.1) that are BANs A, B, C, and D in Fig. 4 . After entering "1" for the number of buses (user option 2.2), the user inputs "BFBA" for the bus type (user option 2.3) to specify the bus subsystem. For the bus type BFBA, the user assigns the fields of bus property as follows: "32" for address bus width (user option 3.1), "64" for data bus width (user option 3.2) and "1024" for Bi-FIFO depth (user option 3.3). Next, the user inputs the fields of BAN Property for each BAN specified in the bus subsystem property: "MPC755" for the CPU type (user option 4.1), none for the non-CPU type (user option 4.2), and "1" for the number of memories (user option 4.3). Finally, the Memory Property is input for the single memory chosen for each BAN: "SRAM" for the memory type (user option 5.1), "20" for the address bus width (user option 5.2) and "64" for the data bus width (user option 5. According to the user options shown in the right-hand box of Fig. 14 , the user can customize the bus architecture of a bus system by using our bus synthesis tool BusSynth. As one of the customized bus architectures, the user might want to generate a mixed bus architecture by using more than one of the bus architectures we defined (e.g., GBAVI, BFBA and GBAVIII). Example 7 describes how to generate a customized bus architecture for a bus system by the user options in BusSynth.
Example 7: Customized Bus Architecture (Hybrid): Suppose a user wants to generate a combined bus architecture using several of the custom buses explained earlier: specifically, the combined bus architecture of both the Bi-FIFO bus from BFBA and the global bus from GBAVIII. As shown in Fig. 6 , we call the bus system having this combined bus architecture as Hybrid. To generate such a bus system, the user needs to input the user options shown in the right hand box of Fig. 14 as follows. First, the user enters "1" for the number of bus subsystems (user option 1). Then, the bus subsystem property (user option 2) is specified as follows: "4" for the number of BANs (user option 2.1), "2" for the number of buses (user option 2.2), "BFBA" for the first bus type (user option 2.3), "GBAVIII" for the second bus type (user option 2.3). For the specified buses, BFBA and GBAVIII, the user enters their properties, respectively. In the BFBA bus property, address bus width (user option 3.1) is set to "32," data bus width (user option 3.2) is set to "64," and Bi-FIFO depth (user option 3.3) is entered with "1024." In the GBAVII bus property, address bus width (user option 3.1) is "32," and data bus width (user option 3.2) is "64." Next, the user enters each BAN property; the user specified that the bus subsystem has four BANs in user option 2. In each BAN (e.g., BAN A, B, C, and D) Property, CPU type (user option 4.1) is set to "MPC755," Non-CPU type (user option 4.2) is entered with "NONE," and the number of memories (user option 4.3) is input with "1." Based on the user input entered so far, each BAN has a single memory block, and thus total four memories are in the bus subsystem (see user option 2.1). Finally, memory property for each memory block (e.g., SRAM A, B, C, and D shown in Fig. 6 ) is entered as follows: "SRAM" for memory type (user option 5.1), "20" for address bus width (user option 5.2) for 8 MB size, "64" for data bus width (user option 5.3).
3) Unit Generation: We introduce here an algorithm, UnitGen, which is used to generate in HDL a hardware unit that is specified to be part of the bus system desired by the user. In particular, given a list of modules as input, UnitGen generates the wires needed to connect all the modules together appropriately. Depending on the input list (array) of module names, UnitGen can generate a BAN, a bus subsystem or a bus system. UnitGen (short for "Unit Generator") is used by (called from) algorithms (BusSubSys or BusSys) that will be introduced in Sections V-B4 and V-B5.
Example 8: Array of Module Names Input to UnitGen: Consider the case where we generate a hardware unit, BAN A of BFBA shown in Fig. 4 . UnitGen, shown in Fig. 15 , takes as input an array of module names that contains MPC755, MBI SRAM, HS REGS, CBI MPC755, SRAM A, and Bi-FIFO since these six modules are the components of BAN A. Fig. 15 shows the pseudocode of UnitGen. The input arguments are an array of module names, the name of the top hardware unit to be generated and a pointer to the wire library. The input array of module names contains all the names of all modules in a top hardware unit to be generated. Since UnitGen integrates modules specified in the module name array, such modules to be integrated are provided as separate HDL files (extracted from the module library on the left of Fig. 14) . However, while UnitGen does not use the module library explicitly, UnitGen does use the module library implicitly by use of the wire library to generate wires for the specified design.
In lines 2-8 of Fig. 15 , to connect modules specified in an array of module names, UnitGen first extracts specific wires to connect between modules from a wire library file; this wire information is placed in a data structure LW1. Here, each record of LW1 is composed of the same fields as the fields shown in the wire library format of Fig. 11 . In lines 12-17, port information of the modules is read from separate HDL files that were generated for the modules in advance; the resulting port information extracted is placed in a data structure LP1. Here, each record of LP1 is composed of following fields: port name, port direction and port width. Lines 18-27 of Fig. 15 compare, for each module: 1) the port name of each wire contained in LW1 with 2) each port name (corresponding to a specific port of a specific module) contained in LP1. Thus, UnitGen can decide required wire connections among the modules specified in the array of module names utilizing port information of the modules. With the comparison performed in lines 18-27 of Fig. 15 , UnitGen saves the wire-port mapping information for the specified modules to a linked list LWPM in line 22. Ports with no internal connections-and thus definitely external ports for the hardware unit to be generated-are saved to a linked list LP2 in line 26. Finally, in lines 29-35, UnitGen writes synthesizable Verilog HDL code by generating in a declarative fashion the instantiation code of the modules including all wires between modules. Example 9 shows how UnitGen generates a hardware unit in an HDL file.
Example 9: Unit Generation: Consider the generation of a hardware unit, BAN A of BFBA shown in Fig. 4 . As shown in Example 8, UnitGen first takes an array of module names that contains MPC755, MBI SRAM, HS REGS, CBI MPC755, SRAM A, and Bi-FIFO. In lines 2-8 of Fig. 15 , UnitGen extracts specific wire data to connect between modules (e.g., one wire datum is w name "w addr," m1 name "SRAM A" and m1 pname "sram addr" in the format of Fig. 11 ) from the wire library and saves the wire record (e.g., w name "w addr," m1 name "SRAM A," and m1 pname "sram addr") to LW1. In lines 12-17 of Fig. 15 , UnitGen obtains port information (e.g., port name "sram addr") from the current module (e.g., "SRAM A") and saves the port name to LP1. Next, in lines 18-27, to decide a specific wire that connects between modules, UnitGen compares, for current module (e.g., "SRAM A"), an associated port name (e.g., "sram addr") field in LP1 with a port name (e.g., "sram addr") field of LW1. If both the fields are equal, they need to be connected (by design, the module and wire libraries are constructed to assign the same name to ports which can be connected), and UnitGen takes the wire information (e.g., "w addr") in LW1, port information (e.g., "sram addr") and current module name (e.g., "SRAM A"), and saves them to LWPM in line 22. LWPM will be used later to generate wires in Verilog HDL. This procedure (from line 18-27) is repeated for all ports in LP1. Finally, in lines 29-35, UnitGen generates the instantiation code for each module, including all wires, in the form of Verilog HDL code describing BAN A in a top Verilog file .
4) Bus Subsystem Generation:
Here, we explain bus-subsystem generation. In order to generate bus subsystem(s), we use the algorithm BusSubSys shown in Fig. 16 . Input arguments of the algorithm are an array of module names in each BAN of each bus subsystem, an array of BAN names in each bus subsystem, the number of bus subsystems, an array of number of modules in each BAN of each bus subsystem, an array of number of BANs in each bus subsystem, an array of parameters that specify the properties of each module, and, finally, a pointer to the module library. Example 10 shows an example of the arguments in the pseudocode shown in Fig. 16 .
Example 10: Input Arguments in BusSubSys Algorithm: Consider the case where we generate BFBA bus subsystem in Fig. 4 . An array of module names for each BAN is as follows: "MPC755 A," "CBI MPC755," "SRAM A," "MBI SRAM," "HS REGS," "BI-FIFO A" "MPC755 B," "CBI MPC755," "SRAM B," "MBI SRAM," "HS REGS," "BI-FIFO B" "MPC755 C," "CBI MPC755," "SRAM C," "MBI SRAM," "HS REGS," "BI-FIFO C" "MPC755 D," "CBI MPC755," "SRAM D," "MBI SRAM," "HS REGS," "BI-FIFO D" . An array of ban names is "BAN A," "BAN B," "BAN C," "BAN D"
is "1," an array of the number of modules is "6," "6," "6," "6" , an array of the number of BANs is "4" . To specify 32 MB total of SRAM and a BI-FIFO with 1024-depth and 64-bit width, an array of parameters is "20", "64" "10," "64" "20," "64" "10," "64" "20," "64" "10," "64" "20," "64" "10," "64"
. Here, "20," "64" is for the widths of address and data buses in an SRAM in each BAN, respectively, and "10," "64" is for the depth and data bus width of the BI-FIFO in each BAN, respectively.
BANs in bus subsystem(s) are generated by calling UnitGen in line 17 after modules in each BAN are extracted from the module library as shown in lines 5-16. Then, the bus subsystem(s) are generated by connecting (choosing the appropriate wires for) the generated BANs via a call to UnitGen in line 19. Example 11 shows the generation of a sample BAN, and Example 12 shows how a bus subsystem is generated by BusSubSys shown in Fig. 16 .
Example 11: BAN Generation: BusSubSys first takes arguments as shown in Example 10. For BAN A of BFBA shown in Fig. 4 , the required modules are as follows: MPC755, MBI SRAM, HS REGS, CBI MPC755, SRAM A, and BI-FIFO. In lines 5 to 16 in Fig. 16 , BusSubSys extracts four modules (MPC755, MBI SRAM, HS REGS and CBI MPC755) from the module library, and the last two modules (SRAM A and Bi-FIFO) are generated with parameters in an array of parameters that is one of input arguments. In other words, SRAM A is generated with a 20-bit address bus width and a 64-bit data bus width, and BI-FIFO is generated with a 10-bit address bus width and a 64-bit data bus width. (Note that we assume standard tools from companies such as Synopsys [27], Artisan [28] , and Virage Logic [29] are available.) Then, in line 17, BusSubSys calls UnitGen together with a hardware unit name to be generated and an array of module names that contains MPC755, MBI SRAM, HS REGS, CBI MPC755, SRAM A, and BI-FIFO. After the procedure shown in Example 9, UnitGen finally writes Verilog HDL code describing BAN A.
Example 12: Bus Subsystem Generation: To generate BFBA bus subsystem shown in Fig. 4 (which is also a bus system), BusSubSys takes as input the arguments shown in Example 10. As shown in Fig. 4 , the bus subsystem to be generated is composed of BANs A, B, C and D that are generated in the same fashion as shown in Example 11. Then, BusSubSys calls UnitGen in line 19 of Fig. 16 to generate the bus subsystem, and UnitGen finally instantiates generated BANs A, B, C, and D and wires them together by writing Verilog HDL code describing the bus subsystem.
5) Bus System Generation:
We now describe the generation of a bus system. The generation is carried out after the TABLE II  EXAMPLE OF THE NUMBERS IN TABLE I generation of any necessary bus subsystem(s) as shown in Section V-B4 since the generated bus subsystem(s) is (are) integrated into a bus system. Fig. 17 shows the pseudocode (BusSys) for the bus system generation. First, BusSys takes three arguments: an array of bus subsystem names that specify bus subsystems in a bus system, an array of the names of bus bridges that connect the bus subsystems, and a module library. As shown in line 3, BusSys is performed only if a bus system has multiple bus subsystems. The reason is that a bus subsystem becomes a bus system if the user wants a single bus architecture for the entire chip instead of multiple bus architectures in the SoC. A bus system is also formed by connecting generated bus subsystems through bus bridges. The module(s)-e.g., a bus bridge or a first-in first-out (FIFO) memory-to connect multiple bus subsystems are extracted from the module library in lines 4 to 9 of Fig. 17 ; then, in line 12, BusSys calls UnitGen to integrate the bus subsystems and modules to connect the bus subsystems.
As we have explained throughout this section, BusSynth can generate modules as well as do a syntactic translation from highlevel input descriptions based on the user options to output synthesizable Verilog HDL code for a multiprocessor SoC.
6) Computational Complexity of Bus-Synthesis Algorithm:
Now, we consider the computational complexity of the bus-synthesis algorithm, which shows how it scales with increasing numbers of bus subsystems, BANs, modules, ports, and wires. The BusSynth algorithm is shown on the left hand side of Fig. 14 and consists of calls to UnitGen for BAN generation, BusSusSys for bus subsystem generation and BusSys for bus system generation. We define several variables in Table I that are related to the computational complexity, and Table II shows an example of the numbers shown in Table I in our case.
We first consider the computational complexity of the UnitGen algorithm shown in Fig. 15 for each case of BAN, bus subsystem, and bus system generation. In the case of BAN generation, the upper bounds of each routine in the algorithm are shown in Table III , and the complexity of the algorithm will be the worst case of Cases 3 and 9 since both cases are performed sequentially, as shown in Case 10 of Table III . Therefore, UnitGen has in computational complexity in the case of BAN generation. Similarly, UnitGen has and in the case of bus subsystem and bus system generation, respectively.
We now consider the computational complexity of BusSynth when BusSubSys (Fig. 16) and BusSys (Fig. 17) are executed sequentially as shown in the flowchart of Fig. 14 . Table IV shows the upper bounds of each routine in the BusSubSys algorithm shown in Fig. 16 . Case 9 of the table shows the upper bound of the algorithm; that is, the computational complexity is . The upper bounds of each routine in the BusSys algorithm shown in Fig. 17 are shown in Table V . The upper bound of the BusSys algorithm is Case 4 in Table V ; that is, the computational complexity is . As we discussed before, Case 9 of Table IV and Case 4 of  Table V show the upper bounds of BusSubSys and BusSys algorithms, respectively. Therefore, since those algorithms are executed sequentially, the overall complexity of BusSynth is O(max[case 9 in Table IV , case 4 in Table V] ). This is to say, the computational complexity of the BusSynth algorithm is . Here, the computational complexity seems to be quite complex and high. However, please note that the numbers specified in the variables above are highly constrained in realistic problems as shown in Table II . For that reason, as shown in Table IX of Section VI-B, BusSynth takes only a second or less to generate our examples in our experimental environment shown in Section VI-B.
The main point of note is that while our algorithms have nontrivial polynomial time complexities, our algorithms are applied to situations with integers in the ten to one thousand range (as opposed to billions or more). For example, in our practical case described in the next section, the number of "legal" wires in our wire library is 686 for 35 modules, 445 for 23 modules and 369 for 17 modules. While all possible wires between modules, including all "legal" and "illegal" combinations, would clearly scale exponentially as the number of modules increases, as we can see the actual numbers of "legal" wires and modules scale somewhat linearly with each other. Thus, we posit that in most practical cases, the number of required "legal" wires scales in such a way that the described algorithms of this section complete in seconds or less, as shown in all cases in the follow section of experimental results.
VI. EXPERIMENTAL RESULTS

A. Application Examples
Five kinds of bus architectures for a multiprocessor SoC were generated using BusSynth and then simulated to evaluate performance with three applications: a database example [30] , which is composed of a server task and forty client tasks; an MPEG2 decoder [31] , [32] ; and an orthogonal frequency division multiplexing (OFDM) transmitter [33] , which is used in wireless communications. Details of each application are in [38] and [39] . Fig. 18 describes the computation performed by each processor according to application programming styles: pipelined parallel algorithm (PPA) and functional parallel algorithm (FPA). Here E, F, G, and H in Fig. 18 indicate function groups of an application. With the styles, we can explore how the programming styles affect performance as shown in Section VI-C.
B. Experimental Environment
As shown in Fig. 19 , BusSynth takes the user input as described in Section V-B and outputs synthesizable Verilog HDL code for the specified custom bus system. For the bus system simulation, we use Seamless CVE, a hardware/software coverification tool, and X-Ray debugger from Mentor Graphics [35] together with VCS, a Verilog HDL simulator from Synopsys [36] . In order to synthesize the Verilog HDL code to logic gates, we use the Synopsys Design Compiler. For this environment, we use a Sun workstation Ultra 60 having two 450-MHz Ultra-SPARC II processors and 2 GB of memory.
In this experiment, we set up four MPC755s in Seamless CVE; each BAN has one MPC755 with 100 MHz external clock, SYSCLK. The maximum frequency of SYSCLK, which dictates the maximum bus speed, is limited to 100 MHz in the PowerPC Hardware Specification (note that the internal clock speed can be much faster, e.g., 500 MHz) [37] . However, our results are equally applicable to much faster bus clock speeds. Note that the MPC755 instruction set simulator (ISS) provided by Seamless CVE is instruction accurate, not cycle-accurate; nonetheless, external (noncache) memory accesses are cycle accurate. In short, we have a bus functional simulation setup with cycle accuracy for all bus transactions.
C. Comparison of Result
With the generated bus systems (shown in Figs. 3-7) and hand-designed examples of CCBA and GGBA (shown in Figs. 8  and 9 ), we evaluate the performance and verify the operation of each bus system with an OFDM transmitter, an MPEG2 decoder and a database example. Please note that many partitions of tasks to PEs were tried; we report only the best results obtained (i.e., the best possible partition found by hand for the given bus architecture). The bus systems have 32 MB total of non-L1 cache memory, respectively, and each processor (MPC755) embedded in each bus system has 32 KB of L1 I-cache and 32 KB of D-cache. Table VI shows the results of our evaluation using an OFDM transmitter that in our example has 922 lines of C code for the algorithm implementation and 696 lines of assembly code for processor runtime initialization and APIs. The operation of BFBA and GBAVI is well matched to the PPA style because BFBA and GBAVI only have data transfer mechanisms between BANs instead of having a memory shared among all BANs. SplitBA is composed of two bus subsystems connected with a Bus Bridge, and the two bus subsystems operate independently. Therefore, in SplitBA, it is more reasonable to use the FPA style. SplitBA (Case 7 in Table VI ) using the FPA style shows the best performance among the bus systems in our example: OFDM transmission reaches a rate of 5.1132 Mbps, 16.44% faster than GGBA, which we take as representative of a typical commercial bus. We can see in Table VI that the throughput of each bus system is significantly affected by the bus types we described and programming style (PPA versus FPA):
1) In software-programming style, FPA beats PPA in the OFDM transmitter application (e.g., Case 3 versus 4, Case 5 versus 6 and Case 8 versus 9 in Table VI ). The reason is that, for OFDM, FPA balances the computational load better than PPA does. 
2)
Bus systems using a shared memory for program and local data (e.g., GGBA) require more memory arbitration time than do bus systems having separated local memories for program and local data for each BAN (e.g., GBAVIII). This arbitration time difference explains why GBAVIII outperforms GGBA.
3)
SplitBA relieves bus traffic congestion due to shared memory requests from each BAN. The reason is the bus system has spited bus architecture, and thus each arbiter in each Subsystem deals with only half the number of total memory requests from each BAN. With this reason, SplitBA beats GGBA in our example (Case 7 versus 8).
4)
A fast data-transfer method between BANs such as BiFIFO of BFBA and BiFIFO of Hybrid contributes to the performance improvement observed for the PPA style (e.g., Case 1 Case 6 Case 4 Case 9 Case 2, in throughput). Our MPEG2 decoder application has 8788 lines of C code for its algorithm and 697 lines of assembly code for initialization routines and APIs. Due to the requirement of significant global memory interaction due to a large number of global variables in our MPEG decoder code, we could only use FPA effectively; thus, show that Hybrid and GBAVIII outperform CCBA due to faster arbitration time in data read operations (three cycles as compared to five in CCBA). In Table VII , BFBA and GBAVI perform poorly because the data to be processed in each BAN has to be passed from BAN A to each BAN sequentially. Note that Hybrid, generated by BusSynth, outperforms CCBA by 15.54% in this example.
In the database-application example, for multithread operation, we employ the Atalanta RTOS [34] , which requires a shared memory. We can support the use of the RTOS in GBAVI and BFBA; however, in this paper, we do not simulate these bus systems with this application because the current versions of these bus systems do not have such a shared memory. Furthermore, this application is an example using only a shared memory without using local memories for data transactions between the server and the clients. Therefore, when we assume that, in this example, we do not use a Bi-FIFO bus nor local memories, bus systems having a global memory and single global bus (e.g., GBAVIII, Hybrid and GGBA) have almost exactly (within 0.1%) the same performance in this example due to the same bus components. For that reason, we use one of these bus systems, GGBA (see Fig. 9 ), as a baseline of performance comparison and compare the performance only with SplitBA (see Fig. 7 ) in this application. The performance of SplitBA is improved over GGBA because of following two reasons. The first one is that SplitBA has a better bus topology [e.g., split global bus connected by a bus bridge (BB)] than GGBA, and, thus, bus traffic due to the shared memory requests is lessened. The second one is that SplitBA has a shared bus architecture in each bus subsystem so that all clients can easily access object data from the server.
This example has total of 1700 lines of C code for the database algorithm and runs on top of the Atalanta RTOS. A total of forty-one tasks are executed for clients and a server; BAN A in Fig. 7 has one server task and ten client tasks, and the other BANs in the figure each have ten client tasks, where each task accesses one-hundred words (32 bits per data word) to or from a shared memory in each bus system. In the experiment of the database example shown in Table VIII , SplitBA (Case 16 in Table VIII ) outperforms GGBA (Case 15 in Table VIII ) with a 41% reduction in application execution time. Table IX shows the generation time for the bus systems generated using BusSynth. Table IX also shows the gate counts of the bus system logic after synthesizing the logic using the LEDA TSMC 0.25 m standard cell library with the Synopsys Design Compiler. Since our goal is cycle accurate hardware/software cosimulation, we do not include layout parameters such as wire area in our area estimates. Thus, after using our tool, extra work is required to obtain layout accurate area and timing estimates for the final chip implementation. BusSynth can generate a bus system having any number of processors, but the table shows bus systems having a maximum of 24 processors. In the generation time column, each bus system shown in Table IX takes less than one second to generate using BusSynth. Our experience is that porting GGBA or CCBA to our application examples, on the other hand, took about one week. The week was spent understanding signal functions of the processors and the modeling of required modules and their interfaces. Note that BusSynth achieves performance superior to the hand design of GGBA and CCBA; furthermore, the user specified custom bus architecture is designed in a matter of seconds instead of weeks. This means we have a major benefit that is fast-design space exploration of bus architectures across performance influencing factors such as bus types, processor types, and software programming style resulting in a system having higher performance. This goal is accomplished through BusSynth, which allows the user to easily design a custom bus system in a matter of seconds.
VII. CONCLUSION
In this paper, we have described a methodology to generate custom bus systems for multiprocessor SoC designs. We designed a bus-synthesis tool BusSynth by exploiting this methodology. Using BusSynth, we have generated five different bus systems as examples: BFBA, GBAVI, GBAVIII, Hybrid, and SplitBA. The algorithms have been described in significant detail and have been shown to finish in reasonable time (under a second) in the practical cases shown. In Section VI, the bus systems are evaluated according to their performance and are verified in operation with three applications: an OFDM transmitter, an MPEG2 decoder and a database example. Our methodology gives us a great benefit in fast-design space exploration of bus architectures across performance influencing factors such as bus types and software-programming style. We showed that BusSynth achieves performance superior to the hand design of a simple GGBA and CCBA, but in a matter of seconds instead of weeks for the hand design. In particular, we show up to 41% reduction in application execution time with a customized bus architecture. 
