As technology scales toward deep submicron, the integration of complete system-on-chip (SoC) designs consisting of large number of Intellectual Property (IP) blocks (cores) on the same silicon die is becoming technically feasible. Until recently, the design-space exploration for SoCs has been mainly focused on the computational aspects of the problem. However, as the number of IP blocks on a single chip and their performance continue to increase, a shift from computation-based to communicationbased designs becomes mandatory. As a result, the communication architecture plays a major role in the area, performance, and energy consumption of the overall systems [1, 2] . This article presents a structure of a wrapper as a component of Code Division Multiple Access, CDMA, based shared bus architecture in a SoC. Two types of wrappers can be identified, master and slave. A master wrapper is located between the arbiter and CDMA coded physical interconnect, while a slave connects the CDMA coded bus with memory/peripheral module. In the proposal, only bus lines that carry address and data signals are CDMA coded. We implemented a pair of master-slave wrapper described in VHDL and confirmed its functionality using testbenches. Also we synthesized wrappers 2 using a Xilinx Spartan, and Virtex devices to determine resource requirements in respect to a number of equivalent gates, communication bandwidth, latency, and power consumption. Specifically we involved a Design_Quality, DQ, metric for wrapper performance evaluation. A pair of master-slave wrapper seems to occupy appropriate space, in average 2000 equivalent gates, considering CPU cost of about 30000 gates, what is less than 8% of hardware overhead per CPU. We also present experimental results which show that benefits of involving CDMA coding relates both to decreasing a number of bus lines, and accomplishing simultaneous multiple master-slave connections at relatively low power consumption and high communication bandwidth. Convenient range indices R W and R R to determine data transfer rate for Write and Read operations in multiprocessor bus systems that use TDMA and CDMA data transfer techniques. The obtained results show that increased data transfer latencies involved by CDMA data transfer are compensated by simultaneous master-slave transfers.
Introduction
A major trend in modern SoC designs is integrating numerous homogeneous and heterogeneous system components, i.e. IP cores, onto a single chip. Increasing number of system components are leading to rapidly growing on-chip communication bandwidth requirements. IP core, such as, for example, CPU, may compute very fast, but if the instructions and data do not reach the processing IP core in time, it simply has to wait. Consequently, on-chip communication architecture has become the bottleneck for improving system-level performance, and as very important SoC constituent it requires special design attention [1, 2] . data transfer connections. From the other hand, the main disadvantage deals with increasing the latency of Read and Write processor cycles.
The rest of this paper is arranged as follows. Section 2 discusses communication architecture topology based on traditional shared system bus and interfacing IP cores into a SoC. Multiprocessor system based on CDMA shared system bus and wrappers is proposed in Section 3. The bus wrapper structure, which include CDMA encoding and decoding scheme, as well as timing of Read and Write operations over CDMA coded bus, is given in Section 4. Design dilemmas which relate to economical aspect of SoC design and fabrication, ASICs vs FPGAs choice, and quality of service (QoS) as a measure of performance in data transfer are discussed in Section 5. The experimental setups, needed to examine performance of the proposed solution, are given in Section 6. Experimental results are presented in Section 7. Finally, conclusions are drawn in Section 8.
Shared on-chip communication architecture and interfacing IP blocks into SoC using wrappers
The system bus is the simplest example of shared communication architecture topology and is commonly found as the most popular integration choice in many commercial SoC designs today [4, 5] .
At a given moment, standard bus architectures allow single master-slave connection. The bus allocation in single master-slave connections is determined by an arbitration protocol implemented of the arbiter's logic. However, by using CDMA data transfer technique, which is based on a concept that each master-slave set can use its unique code subset, it is possible, over shared bus, to realize multiple master-slave data transfers, simultaneously. In order to implement this approach, we need to: i) modify the arbiter's hardware (by increasing its complexity); and ii) involve minor modifications in a wrapper structure.
One of the major problems that most designers encounter during the phase of integrating IP blocks into SoC relates to the interfacing of IP blocks that use different communication protocols. To integrate heterogeneous IP cores, wrappers are widely used [9, 10, 11] . A wrapper is a layer of logic that surrounds the IP core and forms the interface between the core and its SoC environment. In other words, the wrapper logic encapsulates IP core and converts its signaling protocol to a standardized interface protocol. Within a SoC, on-chip bus architectures can be classified into: a) standard buses [9] ; and b) wrapper-based buses [1, 9, 10] .
In our proposal we implement a wrapper logic in order to: i) achieve data transfer from IP cores that use different protocols; and ii) support simultaneous multiple master-slave connections over a shared bus.
Multiprocessor system based on CDMA shared system bus
During the last decade there has been pronounced interest in using efficient (high-bandwidth) communication protocols to meet the interconnect needs of IP cores within the SoC. One such promising technique is CDMA [2, 12, 13, 14, 15, 16] . CDMA is a spread spectrum technique which encodes information prior to transmission onto a communication medium, permitting simultaneous use of the medium by separate information streams. The basic idea of this technique is that interconnect wiring can be drastically reduced by using CDMA encoding and an appropriate interconnection strategy. CDMA technology relies on the principle of codeword orthogonality, such that when multiple code-words are summed, they do not interfere completely with each other at every point in time and can be separated without loss of information [13, 14] .
In order to develop a solution for wide range of embedded applications which requires low cost, IP core reusability, efficient core interfacing, multiple master-slave connections, and moderate communication performance, we propose a CDMA coded wrapper-based SoC interconnect as an efficient solution which can be used on a complex chip.
When a CDMA technique is implemented on standard multiprocessor system presented in Fig.   1a ), we obtain a scheme given in Fig. 1b) . In order to simplify a schematic presentation given in Fig.   1b ) and make the discussion clear, we will assume that the multiprocessor system consists of two local computers, CPU 1 and CPU 2 , and two shared memory or peripheral modules, MEM 1 /PER 1 and MEM 2 /PER 2 . By comparing the structures sketched in Fig. 1a) and Fig. 1b) , we catch sight of the following differences:
a) The Standard Shared System Bus, SSSB (see Fig. 1a )) is substituted with a CDMA Shared System Bus, CSSB (see Fig. 1b) ).
b) The master bus wrapper, BW_CPU, converts data and address signals from CPU 1 and CPU 2 master modules into CDMA coded bus signals. In all standard solutions based on AMBA bus [17] , CoreConnect [18], STBus [19] , etc., the data transfer protocol over system bus is mainly defined by the timing (signaling) of the Control_bus; i.e. not by the timing of the Address_bus and Data_bus. This fact allows us to implement, with minimal modifications, our proposal to any already developed bus protocol such as those valid for AMBA bus [10, 11] , CoreConnect [1] , etc. Namely, only signal lines that are used for address and data transfer are CDMA coded, while signal lines that belong to the Control_bus remain unchanged. In our case over the Control CDMA bus unmodified control bus signals that belong to Binary_Coded_Buses, BCB 1 and BCB 2 are transferred.
c) The Bus_Arbiter, BA, given in Fig. 1b) is realized using the following two building blocks:
Arbiter_Switching_Logic, ASL, and Arbiter_Control_Logic, ACL. According to the implemented algorithm for bus priority assignment the ACL's output Switch, defines which CPU bus (consisting of DATA CPU , ADR CPU and Control CPU bus) will drive a corresponding BCB 1 
CDMA wrapper structure
As is pictured in Fig. 1b) , IP cores are connected to the CDMA coded bus through two types of wrappers, a master bus wrapper, BW_CPU, and a slave bus wrapper, BW_MEM/PER. Both the slave wrapper, BW_MEM/PER, and the master wrapper, BW_CPU, with minor differences, are of almost identical hardware structure.
In general, the wrapper logic is organized around three information flows, called data-path, address-path, and control-path. Each wrapper connected to some classical on-chip bus implements a protocol conversion logic which is used as an interface between the internal logic of IP core and onchip bus. In order to improve IP core reusability and be protocol compliant with any standard on-chip bus we decided to involve a CDMA coding for data and address wrapper transfer paths, only. In this manner, the usage of a standard bus protocol [1, 20] or usage of a standard component protocol [1, 21] can be selected by CDMA wrapper designers without restrictions.
The two main reasons why we implement a CDMA bus based wrapper on FPGA technology are the following: Firstly, the recent development of Platform-FPGA or Field-Programmable SoC architectures with immersed coarse-grain CPUs, embedded memories, and special function IP cores are now practical and commercially available, what offers the potential for immense computing power as well as opportunities for rapid embedded system prototyping. Such architectures promise the flexibility of traditional general-purpose processors while also providing the efficiency and high performance of
ASICs. An example is the Xilinx Virtex4 family of FPGAs that integrates on the same IC up to two PowerPC405 processors and up to 200 000 programmable logic cells [22] . As a second, to manage the complexity and tap full potential of these FPGA based architectures presents many challenges. One of the most daunting challenges is how to efficiently realize the on-chip interconnect. On-chip communication is well-known problem that has been addressed many times throughout the technical literature in SoC architectures [1, 8] . Extensive research has been performed in single master-slave onchip bus based interconnect as is evidenced by the large number of arbitration protocols [23, 24] .
However, these approaches face difficulties in dealing with simultaneous multiple master-slave connections over a single shared on-chip bus. In our opinion, the CDMA based bus architecture represents a promising solution for on-chip communication challenges, especially in FPGA based SoC designs when it is necessary to make a compromise between the latency and bus width.
Related works on bus-based wrappers
Several different bus-based wrapper architectures, intended for various types of applications, are already described in [9, 10, 11, 25, 26, 27] .
In [9] a general purpose wrapper based bus for SoC design is described. In addition, wrapper implementation techniques called write buffer switching and slave designated retry control with livelock avoidance scheme are discussed. In [10] the concept of pre-fetching data into register copies added to the wrapper in order to reduce or even eliminate the performance overhead associated with wrapper, while still obeying the Virtual Component Interface, VCI, standard is explained. In [11] a bus wrapper design methodology with interface protocol conversion is considered. Using this methodology it is possible to convert the different interface and different protocol using system design method. In [25] an interface wrapper architecture which provides a generally applicable architecture that can provide support to component and interface evolution, diminishing the potential exponential effects of such changes is described. An overall architecture of a SoC with N cores, each wrapped by an IEEE 1500 wrapper is presented in [26] . Wrapper generation tools using a methodology based on assembling of library components in order to produce a Register Transfer Level, RTL, architecture is described in [27] .
In respect to standard wrapper bus-based implementations described in [9, 10, 11, 25, 26, 27] the main intent of our design proposal was oriented toward the concept of efficient data and address transmission using a CDMA technique. This allow us to up-grade the wrapper architecture but retaining both its function and protocol conversion logic almost unchanged. Having this in mind, in the sequel, we will explain the principle of CDMA coding and the structures of wrapper's building blocks that perform this activity using one relatively simple wrapper-architecture.
Description of wrapper structure
In general, the hardware structures and principles of operation of the BW_CPU and BW_MEM/PER are similar. Therefore, in the next, we will limit our discussion to the explanation of a BW_CPU, only. Further, by ignoring the switching, we assume that the BA logic doesn't involve any bus signal modifications, i.e. CPU 1 _bus or CPU 2 _bus are identical as BCB 1 or BCB 2 . The global structure of the BW_CPU is given in Fig. 2 . At first, we will classify its interface signals into the following six parts:
1. Communication protocol signals -for a corresponding CPU these signals identify: a) a type of the current cycle such as instruction fetch, operand fetch, interrupt, execution, etc.; b) the valid information currently present on address and data lines; and c) an instant when data transfer can start.
2. Command signals -point to a type of current bus activities. and different in respect to solutions presented in [10, 11, 25, 26] relates to the implementation of CDMA encoder and decoder blocks (DED and AE). Namely, instead of classical non-coded data & address bus transfer (see Fig. 1a) ), DED and AE provide usage of CDMA coded data & address bus transfer (see Fig. 1b) ).
CDMA coded bus transfer operations
The operation of a CDMA coded wrapper-based bus we will explain on execution of CPU Read and Write cycles (see Fig. 3 ). The CPU 1/2 IP core issues requests and the MEM/PER 1/2 IP core receives them. states. In addition, the BW_CPU converts address from binary to CDMA form and sends the address via CSSB to the BW_MEM/PER wrapper. The BW_MEM/PER wrapper decodes CDMA coded address and at instant t 2 drives the MEM/PER IP core with a binary coded address via BCB_ADR bus. After the access time, t ACC , has expired, at instant t 3 , a MEM/PER IP core sets its data lines at valid states. The BW_MEM/PER wrapper encodes a binary coded data into a CDMA coded and forwards them via a CSSB back to BW_CPU. The BW_CPU decodes them and at t 4 passes CPU_DATA signals to the CPU IP core. Additionaly, at t 4 the BW_MEM/PER generates a signal RDY_deactivate by which it signals to the CPU IP core to deassert the wait state period, t w . At t 5 the CPU IP core terminates its Read cycle.
In Fig. 3b ) a corresponding scenario for a Write cycle is sketched. The timing of this cycle is simpler in respect to the Read cycle. The main difference is the following: During the first part of a Write cycle, at instant t 1 , the CPU IP core generates an address (asserts CPU_ADR bus), while at it generates a binary coded valid data (asserts CPU_DATA bus). Both CPU address and data bus signals, after passing through BA logic, drive the BW_CPU which converts them into CDMA coded signals. All other timing (signaling) details are similar to ones given in Fig. 3a) . ' 1 t By analyzing the signaling scenario presented in Fig. 3 we can conclude the following: a) Under assumption that time intervals t a , t b and t ACC are identical for both standard binary coded bus (Fig. 1a) ) and CDMA coded bus (Fig. 1b) ), and signal propagation through BA logic is t BA , the latencies for Read and Write processor cycles in a CDMA based bus transfer (Fig. 1b) ) are higher. b) In comparision to the system given in Fig. 1a) , the latency of a Read cycle for a system depicted in Fig. 1b) is time intervals higher, while for Write cycle it is time intervals larger, where v -corresponds to a spreading code width, while -to a chipping rate. 
Design dilemmas
The choice of the technology characteristics is crucial in being able to meet the design goals of the wrapper logic. Until now, designers of high-speed systems were mainly concerned with performance and area, but current and future designs must meet the triple constraints of power, performance, and area. In addition, others attributes such as implementation technology and quality of service (QoS) are very important aspects in deriving wrapper's performance and its flexibility in order to be applied in diverse SoC designs, too. In general, an ideal design solution does not exist, but an acceptable variant of the device definition represents, very often, a balance between numerous technical possibilities and economical aspects. Therefore, in the sequel, we will analyze the influence of those criteria that conducted us to determine the design choice of the proposed wrapper logic.
The first dilemma deals with the economical aspect. Modern SoC design and fabrication are expensive. Design tools are of hundreds of thousands of euros, while mask costs for SoC designs now approach one million euros. For low volume applications, and especially for research projects in universities, reconfigurable SoC based on FPGA devices is more time and cost effective solution.
The second dilemma relates to ASICs vs FPGAs. It is well known that ASICs are optimized for the application, and hence will have the smallest area and use the least power. This comes at the cost of reduced flexibility. In addition, ASIC designs suffer from skyrocketing manufacturing costs and long development cycles. From the other hand, FPGAs define multipurpose chips that include generic hardware resources like logic arrays, flip-flops, RAM modules, processors, and special purpose accelerators that can be configured using a programmable interconnect grid (infrastructure), into specific systems. In essence, FPGAs are very area and power inefficient compared to an ASIC for each application. However, their computational and communication efficiency is good compared to ASICs, so the main intent of many researchers, nowadays, is to substitute each ASIC accelerator with specific domain-oriented architecture. This enables a paradigm shift from application specific circuits to domain-oriented platforms. Because the FPGA is designed once and then programmed, it is possible to run all applications, i.e. FPGAs provide post fabrication programmability of both software and hardware levels. The design of such systems leaves large degree of freedom for the FPGA designer (programmer).
The third dilemma covers QoS. It is envisioned that tomorrow's complex SoC systems will have hundreds of components that will communicate on interconnects operating in the multi-gigahertz frequency range. In such a scenario, there is a need for a communication fabric that is scalable enough to handle the increasing performance requirements of such demanding systems. The communication architecture of such systems must be able to support the QoS needs of heterogeneous systems that will require multiple modes of operations and with varying levels of real-time response requirements. QoS in wrapper bus based SoCs refers to the level of commitment for data (information) delivery. Such commitment can be in the form of correctness of the transfer, completion of the transaction, or bounds on performance. In most cases, however, QoS for wrapper bus based SoCs refers to bounds on performance (bandwidth, delay, and jitter) since correctness and completion of the transaction are often the basic requirements of on-chip data transfers. The bounds of bandwidth, occupied area, latency, and power consumption for CDMA based interconnect will be considered in section Experimental results.
Experimental setup
In order to evaluate: 1) the performance related to design quality of wrapper logic implemented in a CDMA based bus architecture; and 2) latencies of a traditional Binary_Coded_Bus, BCB, (see Fig.   1a )) in respect to CDMA_Shared_System_Bus, CSSB, see Fig. 1b ), we will assume the following:
-Assumptions related to wrapper logic:
1a) The SoC system and its bus organization is based on a concept already depicted in Fig. 1b); 1b) Binary Coded Bus (BCB 1 or BCB 2 ) consists of 32 bit address bus, 32 bit data bus, and Control bus (see Fig. 1b) ); 1c) We have implemented a CDMA encoding scheme on address and data bus signals, only. By using this approach we involve minor modifications in existing bus conversion protocols such as the BVCI to AHB [11] , PVCI to AMBA [28] , and others. From aspect of CDMA encoding the transformations performed on data and address bus signals are identical. Therefore, in the text that follows, a transfer over data bus will be considered, only; 1d) Data transfer over CDMA coded bus is achieved by using parallel lines grouped into bundles of 4, 8, 16, or 32 bus signal lines (see Fig. 4); 1e) Orthogonal Walsh functions, are used for CDMA encoding; 1f) Signaling diagrams which relate to Read and Write processor cycles are identical to those sketched in Fig. 3; 1g) Testbenches were created for estimating: i) power consumption in mW/10 MHz; ii) absolute bandwidth in MB/s; and iii) energy per byte transfer.
-Assumptions related to data transfer latencies:
2a) All items 1a) -1g), already mentioned, are valid; 2b) Master-slave data transfer rates, for both kinds of microprocessor systems (Fig. 1a) and 1b)), will be considered;
2c) The velocity of signal propagation over bus wires is 2*10 8 m/s, and the distance between the master and slave modules is, in average, 30 cm. Accordingly, the signal propagation delay is 1,5 ns; 2d) Time delays involved by both types of arbiter logic, t BAS and t BAC , given in Fig. 1a ) and 1b), are identical; t BAS = t BAC = t BA . In our case t BA = 10 ns; 2e) Access times to all slave modules (memory or I/O modules) are identical; t ACCMEM = t ACCI/O = t ACC . In our design t ACC = 30 ns; 2f) An address decoder is installed in each slave module of Fig. 1a) , only. The address decoder involves time delay t D . In our case t D = 3 ns. Let note that in slave modules installed in CDMA based system (Fig. 1b) ) address decoder logic is not needed. Namely, the decoder is implemented in a CDMA decoder building block thanks to a code orthogonality.
2g) The total time delay involved by a CDMA coding and decoding process (t CDMA -see Fig. 3 ), for different spreading code sizes is presented in Table 1 into a column Total latency.
2h) All bus requests initiated by master modules reach arbiter's inputs at instant zero. According to a bus allocation policy implemented in the arbiter logic, bus requests in a multiprocessor system given in Fig. 1a) will be served in a sequential manner. Contrary, for a multiprocessor system pictured in Fig. 1b ) all requests will be served simultaneously, under condition that doesn't exist any conflict related to simultaneous access directed towards single slave module, i.e. a case when two or more master modules simultaneously issue a request for accessing an identical slave module.
2i) Multiprocessor systems given in Fig. 1a ) and 1b) are composed of k processors (master modules). 
Performance metrics
Two different kinds of metrics will be considered. The first relates to a design quality of wrapper logic, while the second one points to data transfer latency of both multiprocessor systems.
Wrapper performance metrics
With aim to evaluate the performance of a wrapper design, we have involved a metric called Design_Quality, DQ, defined as: 
Data transfer latency ratio
In order to evaluate data transfer latency for both multiprocessor systems (Fig. 1a) and 1b)) we will consider first a timing related to Write (Fig. 3b) ), and Read operation (Fig. 3a) ).
A. Write operation
According to the assumptions 1a) -1g) and 2a) -2i) total access time of master-slave data transfer is equal to: i) Standard system (Fig. 1a) ): When there is single issue for master-slave data transfer, a request will be served for T TS1W = t P + t BA + t D + t ACC (2) by substituting for t BA  7t P , t D = 2t P , and t ACC = 20t P , we obtain
When all k master modules issue requests for data transfer, the access time, T TSW , will be proportional to k, i.e.
T TSW = k * T TS1W = k * 30t P
ii) CDMA based system (Fig. 1b) ): In this case all k master-slave connections are performed simultaneously for T TCW = t P + t BA + t CDMA + t ACC = 28t P + t CDMA (4) where t CDMA is a time interval which corresponds to a total latency given in Table 1 .
B. Read operation
In a similar way, for a Read operation (see Fig. 3a) ) we obtain: j) Standard system (Fig. 1a) ):
T TS1R = 2*t P + 2*t BA + t D + t ACC = 38t P (5)
jj) CDMA based system (Fig. 1b) ):
T TCR = 2*t P + 2*t BA + 2*t CDMA + t ACC = 36t P + 2*t CDMA
Let note that for establishing master-slave data transfer the system from Fig. 1a ) uses TDMA approach, while the system from Fig. 1b ) explores parallelism.
Metrics Q and R
Form now a performance metric, Q, as a product of the following three parameters: a) a number of bus lines, L; b) total access time, T; and c) communication bandwidth, B:
For the standard system (Fig. 1a) ) and CDMA bus based system (Fig. 1b) ), respectively, we have:
where a subscript X can be W for Write operation, or R for Read operation.
Let B 1 = B 2 (i.e. an equal amount of information is transferred by both systems), and define now a data transfer ratio for Write and Read operation, R W , and R R , respectively as
and
In the aspect of data transfer rate performance we will use data transfer ratio, R W and R R , as convenient range indices by which we will evaluate the increased latency of the CDMA simultaneous master slave data transfer (Fig. 1b) ) in respect to a sequential TDMA bus transfer (Fig. 1a) ).
Experimental results
Increasing demand for high-speed on-chip interconnects requires faster links that consume less power. Signal coding is a standard approach used to lower the bus width, achieve low signaling rate, and find a low-power scheme. However, the complexity of those coding systems (transmitter and receiver hardware with micro-power consumption) prohibits their use in high-speed on-chip
applications. An on-chip interconnect scheme based on CDMA technique of relatively low complexity, low-power, and high bandwidth is proposed here and its performance related to design quality of a wrapper logic and data transfer latency are evaluated.
Wrapper logic performance
The wrapper logic was described at RTL level using VHDL. For synthesis, routing, and mapping a Xilinx development CAD tool ISE WebPack 9.1i was used. Design verification was performed using testbenches intended for parallel excitation of all bundle links. The wrapper was implemented on FPGA devices from Spartan2, Spartan3, Virtex4, Virtex5, and VirtexE series. The results generated by a CAD tool relate to: a) a number of equivalent gates, or logic cells, which is proportional to the occupied silicon area; b) a signal propagation time which corresponds to the total latency of a communication channel, i.e. the time interval t 12 , sketched in Fig. 3 ; and c) the dissipated power in mW for a given operating frequency.
Absolute performance
The obtained results that correspond to absolute quantitative performance values are presented in Table 1 . For a given target device, in each row of Table 1 a5) Total latency -total propagation delay which includes signal propagation through a master wrapper, a slave wrapper and a CDMA Shared System Bus (time interval t CDMA in Fig. 3 ).
a6) Power consumption -dissipated power of a wrapper logic pair (transmitter and receiver) at 10
MHz clock period excitation.
a7) Absolute bandwidth -a bandwidth achieved at maximal operating frequency, in megabytes per second, for a given FPGA circuit. a5) Energy per byte transfer -consumed energy per single byte transfer. In general, according to the obtained results presented in Table 1 we can conclude the following:
1. For all design solutions, by increasing the width of a spreading code, what is equivalent to decreasing a number of bundles, the number of lines of a CDMA coded bus decreases, but the latency increases.
2. For each master-slave wrapper the number of equivalent logic gates has its minimum when orthogonal spreading codes of width from 8 up to 16 bits are used.
3. The consumed energy per byte transfer decreases as the number of lines for data transfer decreases,
i.e. in a concrete case for Solution S4 in most design solutions we obtain minimal or near-minimal energy consumption. Such results are direct consequence of hardware overhead for CDMA coding.
4. The communication bandwidth is always highest for Solution S1 and smallest for S4, mainly due to larger number of lines, 32 versus 7, and shorter chip sequences, 4 in contrast to 32.
5. In our opinion the consumed energy per byte transfer is relatively low. It is in the range of 0.0888 pJ/B (for target device from series Virtex5 Solution S4) up to 0.608 pJ/B (for target devices from series Virtex4 Solution S1).
6. Data transfer rates are relatively high and are in the range of 587.9 megabytes per second (for target device from series Virtex5 Solution S1) down to 30.6 megabytes per second (for target device from series Spartan3 Solution S4). . This result implies that the impact of using new technology is not so pronounced, as was the case with a latency ratio, what means that hardware structures of logic cells in Spartan and Virtex series are similar.
Relative performance
For relative performance evaluation of a master-slave wrapper pair a metric DQ was used. For four different connections with parallel lines (eight-, four-, two-, and single-bundle) performance parameters which correspond to a number of links (CDMA channels), a number of equivalent logic gates, and total latency are given in Table 1 . In order to simplify our analysis and result presentation, but without deteriorating its generality, we have further assumed that . Performance 1 χ β α    parameters A, T, and B that take part in forming DQ (see Fig. 5 ) were normalized in respect to their maximal values for all devices series and all solutions (for example, for Spartan2E series maximal value for A is 1912 gates, for T it is 120.992 ns, and for B it corresponds to 32 lines, see Table 1 ). The normalization was used with aim to evaluate a wrapper architectural quality, which takes into account the encoding complexity of a CDMA coded bus for different bus width, i.e. the number of bundles. c) A minimal value of DQ metric, for all design solutions, we obtain for width of spreading code in the range of 8 up to 16 bits. Table 2 reports the results which relate to data transfer latencies for standard and CDMA bus based systems. For implementation of a wrapper logic FPGAs from Spartan2E, Virtex4 and Virtex5 series were used. Multiprocessor systems composed of 2, 4, 8, and 16 master and slave modules were considered. We assumed that all master modules simultaneously access to different slave modules, i.e. a contention problem was omitted. This means that the presented results correspond to maximal data transfer rate. During this analysis we assumed that bus requests issued by master module in Fig. 1a ) are served in a circular way, i.e. fixed time slot bus allocation scheme was used. By analyzing the results given in Table 2 we can conclude the following: 1) Spartan2E series: transfer rates for Write operations are from 28% (for 2-processor systems) to 6% (for 16-processor systems) faster for CDMA based bus architecture in respect to standard bus architecture. Contrary, transfer rates for Read operations standard bus architecture is faster from 23% to 32%.
Data transfer latency performance

Data transfer rates for
2) Virtex4 and Virtex5 series: for both Write and Read operations performance related to data transfer rates for CDMA based bus architecture in respect to standard bus architecture are superior.
Namely, for Virtex4 Write operations are from 126% to 108% and Read operations are from 54% to 34% faster, while for Virtex5 Write operations are from 188% to 172% and Read operations are from 96% to 78% faster.
Let note that the results given in Table 2 are illustrative only. Other factors such as bus allocation policy, physical bus wiring limitations, propagation delay involved by a complexity of the bus arbiter, burst mode of bus transfer, etc. have to be considered in real applications.
In general, the results given in Table 1 and 2 show that: a) involvement of CDMA bus based system is a trade-off problem between decreased number of bus lines and communication time and it may appeal to applications where bus size (wiring) reduction is imperative; and b) increased data transfer latencies involved by CDMA data transfer are compensated by simultaneous master-slave transfers.
Conclusion
Traditionally, design-space exploration for SoCs has been focused on the computational aspects of the problem at hand. However, integrating an increasingly large number of IP cores on the same chip make the design of communication architectures for future SoCs a challenging problem. As a result, design-space exploration with emphasis on communication aspects becomes crucial. Towards this end, in this article we have described a binary CDMA wrapper based bus implementation that has acceptable performance with low hardware cost. Two types of wrappers can be identified, a master located at the output of a bus arbiter in MPSoC and a slave accompanied to memory/peripheral modules. With aim to combine the positive attributes of smaller address and data buses as well as to achieve control bus compliance with existing bus conversion protocols such as AMBA to BVCI or PVCI or others we have proposed a CDMA encoding technique both for address and data buses, but not for a control bus. The proposed solution utilizes orthogonal CDMA coding and a variation of source synchronous clocking in order to achieve channels separation without interference.
Reconfiguration of the CDMA coded bus system is achieved by simply re-assigning the spreading code during the FPGA configuration phase. At behavioral level a wrapper structure was described using VHDL code. For synthesis, routing, and technology mapping a Xilinx development CAD tool ISE In general, like other optimization technique, we always have a trade-off. Here, trade-off is decreased number of bus lines for better wiring performance.
We suspect that there remains further room for improvement, especially multilevel signaling can be used to reduce the number of signal paths and/or to increase the data rate. Table   Table 1 Implementation results for Xilinx FPGA Table 2 : Data transfer ratio for Write and Read operations Notice: t p = 1.5 ns; Column referred as Number of lines includes address and data bus lines. Notice: t p = 1.5 ns; Column referred as Number of lines includes address and data bus lines. 
Captures for Figures
