Abstract-In order to meet the demanding requirements of scalability, adaptability and computational capability of next-generation signal-processing system, a flexible and highperformance signal-processing module based on Serial RapidIO (SRIO) interconnect and Advanced Mezzanine Card (AMC) modules is developed. RapidIO is a packet-switched interconnect intended primarily as an intra-system interface, allowing chipto-chip and board-to-board communications at 10 Gigabit per second performance levels. The AMC specification is defined by PCI Industrial Computer Manufacturer Group (PICMG) and mainly covers the base-level requirements for a wide-range of high-speed mezzanine cards optimized for, but not limited to, Advanced Telecommunications Computing Architecture (ATCA) Carriers systems. With this versatile module, massively parallel multi-processor computing system is feasible. A prototype of 12 modules is introduced and an application of this computing system in MIMO-OFDM system is covered.
I. INTRODUCTION
With the development of high-performance embedded system, the demand of new interconnect and mezzanine card becomes critical, whereas conventional interconnects and mezzanine cards, such as Ethernet and PCI Mezzanine Card (PMC), fail to meet such demand. RapidIO and AMC are such emerging standards that satisfy the needs of so many high-speed applications. RapidIO is designed for chip-tochip, board-to-board, and chassis-to-chassis communications [1] [2] [3] . The SRIO specifications currently include three frequency points: 1.25Gbps, 2.5Gbps, and 3.125Gbps [4] . AMC represents the industry's next-generation mezzanine standard supporting high-speed interfaces, it is optimized for current and emerging Low Voltage Differential Signaling (LVDS) interconnect standard [5] , such as PCI Express, Advanced Switching, Serial RapidIO, and Gigabit Ethernet [6] [7] . A High-Performance Scalable Computing (HPSC) module for next-generation telecom signal processing applications based on AMC is introduced in this paper. SRIO is implemented between TMS320C6455 [8] , Virtex5 [9] and Tsi578 [10] in this module, and finally, an application of this module in MIMO-OFDM system is presented. 
II. THE RAPIDIO SPECIFICATION
RapidIO technology has been optimized for embedded systems, particularly those which require multiple processing elements to cooperate [11] , [12] . SRIO protocol exchanges packets and smaller quantities of link-specific information called control symbols. RapidIO is an industry-standard highspeed switched-packet interconnect. It uses a three-layer architectural hierarchy, shown in Fig.1 . The logic specifications, at the top of the hierarchy, define the overall protocol and packet formats. They provide the information necessary for end points to initiate and complete transactions [13] . The transport specification, on the middle layer of the hierarchy, defines the necessary route information for a packet to move from end point to end point [14] . The physical layer specifications, at the bottom of hierarchy, contain the device level details, such as packet transport mechanisms, flow control, electrical characteristics, and low-level error management. Two families of RapidIO interconnects are defined in physical layer: RapidIO with parallel physical layer and RapidIO with serial physical layer (SRIO) [15] . 
III. THE AMC SPECIFICATION
AMC defines a modular add-on or "child" card that extends the functionality of a Carrier board. AMC modules lie parallel to and are integrated onto the Carrier board by plugging in to an AMC connector. The generic signal mapping across the AMC connector supports a variety of system fabric topologies for connecting AMCs. AMC specification defines the common elements for each implementation including mechanical, management, power, thermal, and interconnect and it comes in several different size but all share some common attributes. The module's I/O can be via the AMC connector.
The AMC specification defines the physical connector used to mate the AMC with its carrier board, the mapping of signals to that connector, and the routing of those signals among the AMCs across the carrier, and also the carrier based switching elements. The Fabric Interface is comprised of up to 21 ports providing point-to-point connectivity for module-to-carrier and module-to-module implementations. The fabric interface can be used in a variety of ways by AMCs and AMC carrier boards to meet the needs of many applications. The ability to deploy interconnect technologies to the fabric interface is limited by the number of assigned fabric interface ports on the AMC connector and the rate capacity defined by that connector.
The synchronization clock interface provides three differential pairs for clock distribution to enable applications that require the exchange of synchronous timing information among modules and consequently multiple boards in a shelf. This allows modules to source clock(s) to the system in the case where it provides a networks interface function, or conversely to receive timing information from another carrier board or module within the system. JTAG provides an industry standard method of performing manufacturing test and verification and is critical to the test of complex products that are often making extensive use of GBA device packages.
System management interface is provided to allow for the initialization and management of the AMC module before fabric interface ports are enabled, and it is also intended to be platform agnostic.
Module power interface includes Management Power (MP) and Payload Power (PWR).
IV. MODULAR DESIGN
The performance of HPSC real-time signal processing system mostly depends on computation capability of single processor, scale of parallelism in system, arrangement of memory architecture and interconnect topology.
A. Interconnect Switches
While there are many ways to connect components in embedded systems, the most prominent are the high speed serial standards of Ethernet, PCI Express and RapidIO. All of these standards leverage similar Serializer/De-serializer (SerDes) technology to deliver throughput and latency performance greater than what is possible with wide parallel bus technology. The trend towards leveraging a common SerDes technology will continue with future versions of these specifications, bandwidth is not a significant differentiator for these protocols. Instead, the usefulness of each protocol is determined by how the bandwidth is used. Each technology is optimized for a particular application space. Ethernet has been optimized for networks which are geographically distributed, have long latencies, and dynamic network configurations. PCIe has been optimized to support a hierarchical bus structure on a single board. Both have been used for on-board, inter-board, and inter-chassis communications, and in many cases both are used in the same system. RapidIO has the potential to combine the benefits of these two interconnects into a single interconnect, with associated power and cost savings [16] . In order to meet different application demand, Gigabit Ethernet and RapidIO are selected as interconnect protocols.
Ethernet is a "best effort" means of delivering packets. The software protocols built on top of the Ethernet physical layer, such as TCP/IP, are necessary to provide reliable delivery of information. Typically, the bandwidth of Ethernet-base systems is over-provisioned by between 20 and 70%. This low cost network, although supporting low speed communication, is best suited for high latency inter-chassis applications or onboard/inter-board applications where bandwidth requirements are low.
RapidIO is the best interconnect choice for embedded system. RapidIO has capabilities that other interconnects will not duplicate, such as: 1) Low latency, low jitter distribution of system events.
2) Combined link level and network level flow control mechanisms. 3) Configurable error detection and topology agnostic routing enable efficient sparing, high reliability and availability. 4) Hardware implementation of both read/write and interprocess communication messaging semantics. These capabilities allow system architects to create better performing systems which consume less power and are easier to scale.
BCM5396 and Tsi578 are selected as Ethernet switch and RapidIO switch separately.
The Broadcom BCM5396 device is a highly integrated solution ideally suited for standalone Gigabit Ethernet switches and Gigabit Ethernet control-plane and backplane applications. It combines all the functions of a high-speed switch system, including packet buffer, SerDes, media access controllers, address management, and a nonblocking switch fabric with 17 Gbps throughput.
The Tundra Semiconductor Corporation (Tundra) Tsi578 is a third generation RapidIO switch supporting 80Gbps aggregate bandwidth. Embedded application further benefit from the ability to route packets to over 64000 endpoints through hierarchical lookup table, independent unicast and multicast routing mechanisms, and error management extensions that provide proactive issue notification to the fabric controller.
B. Processors
Now time, the main-stream signal processors for embedded applications are GPP/RISC, DSP and FPGA. For choosing the right processor, computation capability, easy programmability, power consuming, communication bandwidth and well-defined interface are mostly concerned. As taking all the factors into account, a heterogeneous architecture which consists of DSP and FPGA is a good trade-off. TMS320C6455 is selected as the main processor and XC5VFX100T is selected as coprocessor according to their features.
The TMS320C6455 DSP is the newest high-performance fixed-point DSP in the TMS320C6000 DSP platform from Texas Instrument (TI). The C6455 device is based on the thirdgeneration high-performance, advanced VelociTI very-longinstruction-word (VLIW) architecture. C6455 provides 9600 million instructions per second (MIPS) at a 1.2-GHz clock rate. The C6455 device includes one 1p4xSRIO with 20Gbps throughput, one gigabit Ethernet Media Access Controller (EMAC) for long distance communication, one DDRII-533 SDRAM controller with 512MB total addressable external memory space.
XC5VFX100T is the newest most powerful FPGA in the Virtex-5 family form Xilinx. It provides 680 maximum useravailable I/Os, 16 RocketIO GTP transceivers and 20 RocketIO GTX transceivers. Both of the GTP and GTX are highly configurable and tightly integrated with the programmable logic resources of the FPGA, they offer high data rate features that allow physical layer support for SRIO. XC5VFX100T integrates 2 embedded IBM PowerPC 440 RISCs CPUs and 256 XtremeDSP which provides 128000MMACS running at 500MHz.
C. The WTI6455 AMC Signal Processing Module
Data-plane and control-plane of WTI6455 AMC signal processing module are designed separately. Data-plane is implemented by SRIO switches and SRIO nodes, which are used to transceive and process data in application. Data-plane also provides a redundant RocketIO data path, which offers 3.75Gbps per LVDS. The purpose of control plane is to control, monitor and assure proper operation of AMC module. The control-plane watches over the basic health of the module, reports anomalies, and takes corrective action when needed. Control-plane is provided in two levels, shown as The high-speed management services provide TCP/IP-based management services such as remote booting, SNMP management, remote disk services, and other IP-based services. This high-level management system is composed of Ethernet switches and Ethernet node.
The WTI6455 signal processing module's block diagram is shown as Fig.3 .
The data-plane of WTI6455 consists of one TMS320C6455 and one XC5VFX100T, and they are connected by Tsi578's SRIO interface, which offers 10 Gbps bandwidth when configured as 1p4x mode. Data signals on WTI6455 module can be routed outside the board through the AMC fabric interface, using SRIO on Tsi578. WTI6455 also affords RocketIO transceivers integrated on FPGA to communicate with others.
All functions of MMC in low-level hardware management services are carried out by LPC2468. The LPC2468 microcontroller, designed by NXP semiconductor, contains a 16-bit/32-bit ARM7TDMI-S CPU core with real-time debug interface. LPC2468 is ideal for multi-purpose communication applications for it incorporates various interfaces. The integrated I2C bus on LPC2468 is used as IPMB. LPC2468, TMS320C6455 and XC5VFX100T, connected by BCM5396, comprise the high-speed management system. Gigabit Ethernet signals can also be exchanged out off this module through the AMC fabric interface.
The computation, memory and interconnect capability of WTI6455 are mainly summarized in Table I . This module conforms to PICMG AMC.0 R2.0 specification with full size, double width dimension. A photograph is shown in Fig.4 .
D. The WTI6455 ATCA System
The WTI6455 ATCA Signal Processing System is resource, topology and application scalable. A COTS ATCA shelf which can accommodate 14 ATCA front board is adopted in our system. The ATCA architecture defines the ability to build a flexible backplane (full mesh) that can support a variety of fabric topologies depending upon what types of boards are installed. Dual star and full mesh topologies are supported in our system. Fig.5a demonstrates the topology of dual star. Dual star topology require two dedicated slots (hub slots) for hub boards to be inserted. Each hub slot has a channel connection to each node slot in the backplane. The two hub slots are also connected to each other in the backplane by one channel. WTI6455 ATCA dual star system is comprised of two SRIO hub boards and 12 node boards, each board accommodates one WTI6455 signal processing module. Each hub board supports a 1p4x SRIO connection to all node board and the other hub board in the shelf via 13 available 1p4x connections. Each node board supports two 1p4x SRIO. Identically configured hub boards can be used in both hub slots to establish redundant 1p4x SRIO switching support for each node board connected to the backplane.
In a full mesh backplane, all boards have a direct channel connection to every other board in the backplane as shown in Fig.5b . Since a dual star topology is a subset of the full mesh, the full mesh backplane offers the highest degree of flexibility. Full mesh configurations do not utilize a central switch fabric, all slots can be used for data forwarding and processing resources, which makes maximum use of the physical system capacity. In WTI6455 ATCA system, there are 13 RocketIO channels from each slot to all other slots. That are 78 channels in total. A full mesh backplane requires a large number of backplane trace routes and connector pins per slot than the dual star configuration but offers several advantages such as system scalability, system redundancy, and physical efficiency. It is suitable for lower bandwidth but higher reliability embedded applications.
The WTI6455 ATCA system is topology independent due to WTI6455 module's flexibility and scalability. Different system scale and structure can be implemented through designing of custom secondary backplane and using of SRIO switch module. For high efficiency, WTI6455 ATCA system architecture must be a good mapping of parallel algorithm and processing flow. The two most common partitioning options of processing methods are process farming and data farming (or a combination of both) [17] . Process farming allocates a subset of the required functionality to each processor and acts upon the input data in pipeline fashion. Data farming allocates a subset of the input data to each processor whilst applying the complete required functionality. This prototype is designed as a combination of process farming and data farming system. V. THE WTI6455 ATCA SYSTEM IN MIMO-OFDM APPLICATION WTI6455 ATCA system is applied in the project "Researches on the key technologies in Gbps wireless communica- tion system", which requires 10Gbps throughput in baseband processing system [18] . It is intensive in both computation and throughput capability. This application fully exploits the WTI6455 system's computation capability and high bandwidth. The MIMO-OFDM system consists of front edge, ADC module, digital signal processing subsystem and DAC module. Both ADC module and DAC module are data-intensive and timing-critical and they dont support SRIO network, therefore the RocketIO links between FPGAs are used. The MIMO-OFDM system hardware block diagram is shown as Fig.6 .
The MIMO-OFDM system hardware platform has been constructed and is used for software development. A photograph of the testbed is shown in Fig.7 . The absolute maximum data rate between two WTI6455 AMC modules in one ATCA system can reach 7.5 Gbps. The peak speed between two ATCA shelves across the wireless interfaces almost achieves 0.8 Gbps in our MIMO-OFDM system.
VI. CONLUSIONS
This paper put forward one new kind of baseband signal processing module. This module is based on AMC specification, it has three kinds of interconnect methods which are Gigabit ethernet, Serial RapidlO, and RocketIO links, so it meets the demands of high bandwidth and low latency wireless multimedia systems, and can also be adopted as hardware platform in various embedded systems, such as radar and other digital signal processing systems.
