We present a tier 2 Software Defined-Radio platform (SDR), built around the latest Sandbridge Technologies' multithreaded Digital Signal Processor (DSP) SB3500, along with the description of major design steps taken to ensure the best radio link and computational performance. This SDR platform is capable of executing 4G wireless communication standards such as WiMAX Wave 2, WLAN 802.11 g, and LTE. Performance results for WiMAX are presented in the conclusion section.
Introduction
SDR is a collection of hardware and software technologies that enable reconfigurable system architectures for wireless networks and user terminals. The SDR should provide an efficient and comparatively inexpensive solution to the problem of building multimode, multiband, multifunctional wireless devices that can be enhanced using software upgrades [1] .
Tier 2 Software-Defined Radios provide software control of a variety of modulation techniques, wide-band or narrowband operation, communication security functions, and waveform requirements of current and evolving standards over a broad frequency range. The frequency bands covered may still be constrained at the front-end requiring a switch in the antenna system.
The platform we present in this paper is a tier 2 SDR MIMO capable expresscard, based on the latest DSP from Sandbridge Technologies, the SB3500. It supports both PCI Express and USB 2.0 connectivity and it can also be used standalone, powered from a wall wart 3.3 V/1.5 A supply. In the stand alone mode, data connectivity is provided through a standard USB 2.0 interface and a small adaptor board.
Due to small form factor (54×110 mm in frame) and low power consumption, the platform can easily be transitioned to mobile applications such as smart phones and PDAs.
There are numerous frameworks for SDR platforms [2] [3] [4] targeting both research and development for 4G wireless systems, with cost ranging from a few hundred to tens of thousands of dollars. A large variety of SDR platforms bearing multiple processors and/or expensive FPGAs are currently available. To our knowledge the platform we present in this paper is one of the most cost/performance effective form factor designs currently available. Small form factor designs are quite challenging tasks. They need to meet several contradicting criteria such as low power and very high processing speed, multiple frequency bands spread over a large spectrum with reasonable antenna gain in each band, good signals separation and integrity with densely packed components, and low-cost manufacturing. One of the most challenging limitations in the design process is the thickness of the circuit board. The coupling between signals grows with the circuit board thickness resulting in higher noise and signal interference. At the same time, the cost of the circuit board increases inversely with the thickness. A good compromise between circuit board cost and signal integrity makes a material difference in the overall performance of the device. High-density Ball Grid Array (BGA) devices will also drive the cost of the circuit board. Finally, at GHz range frequencies, the consistency of the circuit board physical properties need to be tightly controlled, since the RF design often requires better than 10% accuracy for trace impedances. This paper is structured as follows Section 2 is dedicated to the hardware platform high-level description. Section 3 is dedicated to the SB3500 processor with a brief description of the Instruction Set Architecture (ISA). Section 4 describes the power supply, the Analog Front End (AFE), and the Radio Frequency (RF) front-end. Printed board circuit design issues and noise minimization methods applied are in Section 5. Measurements on the SDR board are presented in Section 6. The conclusions are provided in Section 7.
The Hardware Platform
The hardware block diagram is illustrated in Figure 1 . As shown, the SDR platform includes two symmetrical Zero Intermediate Frequency (ZIF) RF transceivers, capable of Full Frequency Duplexing (FFD) in one receiver and one transmitter configuration or Time Division Duplexing (TDD) mode in two receivers and one transmitter mode (2 × 1 Multiple Input Multiple Output (MIMO)). Both ZIF channels (A and B in the figure) are digitized by two high-speed 10 bit data converters connected to the System on Chip (SoC) through parallel busses. The I input to the A2D is sampled every rising edge of the sampling clock CLK AD while the Q input is sampled on the falling edge. Both RF transceivers and data converters are controlled through the SPI interface with multiple chip selects (CSA0, CSB0, CSC0, and CSD0) from the SoC. Also, if latency is an issue, the transceivers' gain control can be done through two separated parallel buses (not shown in the figure) connected to one of the General Purpose Input Output (GPIO) ports of the SoC. While transceiver A can be configured as both receiver and transmitter, the transceiver B is fixed to only receive mode. There are three antennas in the system, one transmit antenna fed through a Power Amplifier (PA) and two receive antennas connected to the RF transceivers through Band Pass Filters (BPF). Because of low phase noise requirements (more details about phase noise will come in the following sections) the RF reference oscillator is separated from the system clock.
On the digital side, the SoC is interfaced to the Multiple Chip Package (MCP) through FLASH and DDR memory busses. A separate USB controller, memory mapped in the flash memory area, is used to communicate with the host. Since the FLASH memory is used mostly for the booting sequence, the FLASH bus is less busy.
The whole system is powered using a management scheme compliant with the Point Of Load (POL) architecture, adapted for the size and supply constrains defined 
The SB3500 Processor
At the center of the design is the Sandbridge system on chip, SB3500 [5] . The simplified block diagram of SB3500 is illustrated in Figure 2 . It consists of three nodes containing the SBX cores with one DigRF interface and an ARM926 subsystem with the facility of Direct Memory Access (DMA), Serial Data Input Output interface (SDIO), Universal Serial Bus interface (USB) and an LCD and camera interfaces, a Dynamic Memory Controller (DMC), a Static Memory Controller (SMC), a DPMU and various peripherals such as timers, General Purpose Input-Output GPIO, audio codec, PS2 interface, SPI interface, smart card interface, I2C interface, and UART/IRDA interface. The SBX sandblaster nodes are connected together in a ring topology on a High Speed Synchronous Network (HSN) 64 bits wide at maximum 300 MHz with programmable bus frequency. The peak bus rates are up to 19.2 Gbps.
The lower-speed busses are connected to the HSN using an Advanced Microcontroller Bus Architecture (AMBA) bridge (HAB bridge) forming the fourth node of the HSN. The HAB bridge drives the memory controller on an Advance eXtensible Interface (AXI). AXI supports separate address/control from data busses, unaligned data transfers, and burst-based transactions. The ARM core communicates with HAB either as a master or as a slave, on the Advanced High performance Bus (AHB), using a bus protocol with a fixed pipeline between address/control and data phases. The ARM can also access the peripherals or program the DPMU using an Advanced Peripheral Bus (APB) with the AMBA-AHB to the AMBA-APB bridge. The APB bus uses a simple protocol for general purpose peripherals. All peripherals are available for either the SandBlaster Cores or the ARM 926EJ-S, via their base addresses. A programmable PLL generates the clock, referenced from an external Temperature Compensated eXternal Oscillator (TCXO) source (10 MHz to 50 MHz). The clock generation block from the DPMU distributes several programmable internal clocks to various subsystems. The DPMU directly controls the power domains for the three Sandblaster cores and for the ARM926EJ-S processor. It is also capable of controlling via an external power management IC (using an I2C, SPI, or GPIO) all other power domains for DigRF, DMC, SoC, and IO interfaces separately. Debugging and programming the SoC is possible using separate JTAG interfaces for both the SBX and ARM subsystems.
A detailed structure of the SBX node is present in Figure 3 . The Sandblaster core is a multithreaded processor with four independent threads. It contains a 32 KB instruction cache and a 256 KB internal data memory accessible to all hardware threads as well as from external sources via the HSN. Every core has two Parallel Streaming Data (PSD) interfaces for baseband I and Q, 9 × 24 bit wide Multipurpose Timers (MPTs), an SPI interface which can address up to four devices, and an I2C interface. On some nodes, SPI input-outputs are shared with PSD or I2C. The PSDs are used to control input/output data flow between the SB3500 device and a fast external device (e.g., an A2D/D2A front end converter); the data direction is set either externally via PSD DIR pins or programmed internally in software.
The Instruction Set Architecture for this processor is simple and orthogonal, with Single Instruction Multiple Data (SIMD) for the processing unit. Each cycle, a thread can execute three instructions. The SB3500 Sandblaster core architecture 2.0 was developed to allow the software implementation of the physical layer of the 4G standard. The major change into the 2.0 architecture is the introduction of 16-wide vector operations and instructions specialized for efficient execution of 4G kernel. Those operations are implementing FFTs with 4 complex multiplies per cycle, polynomial multiply, multiply-reduce, and multiply and add, computing the polynomial modulus (galois field arithmetic support). Viterbi decoding is possible with 16 viterbi butterflies in parallel. Turbo decoding is supported for the constraint length of the convolutional codes of 3. There are also available vector operations which rearrange data into registers (packing/unpacking 8-bit to 16-bit data and 16-bit to 32-bit data, shuffling the elements of a pair of register, rotating register pairs, copying or shifting the accumulator into register). Digital signal processing typically uses fixed-point arithmetic. All the vector operations that do addition, subtraction, multiplies, and left-shift have a fixedpoint version.
Hardware Design Considerations
The maximum power available from the expresscard slot is around 2.2 W on 3.3 V, 1.2 W on 3.3 VAUX, and 0.6 W on 1.5 V. Thus the maximum useable power for the SDR board from the expresscard interface is 3.4 W. The power supply, illustrated in Figure 4 , uses a PMIC and a triple Low Drop Out (LDO) for power domains lower than 3.3 V and a MOSFET switch for the 3.3 V power domains. PMIC power enable and system enable inputs are used to implement the sequence required by the SB3500 as shown in Figure 5 . Once the SB3500 is running, DPMU may be programmed via the ARM to turn off the unused power domains. The PMIC may also be programmed through the PM I2C interface (PM GPIO2 and PM GPIO3) to modify all the output voltages or to shut off the unused domains. For handheld applications this versatile scheme allows conserving battery power when the firmware is partially using the SoC hardware resources. The DPMU gets its clock from a low-frequency TCXO which keeps running as long as the card is powered. This way, the DPMU can run in sleep mode with the ARM powered down. After a complete power supply sequence, two resets are generated from the supervisor ICs: an asynchronous power on reset (nPOR) which is initializing the power management, PLL control, and debug interface blocks, and an asynchronous master reset (NRST) for the rest of the subsystems. Some analog design criteria will be described in the following. A careful analog front-end design will result in less noise and signal interference, and as a consequence, less processing power will be needed in order to meet the minimum performance requirements.
The analog section uses an ultra-low-power mixed-signal Analog Front End (AFE) which integrates a dual 10-bit, 52 dB for the RX and 57 dB for the TX. The maximum theoretical signal-to-noise ratio (SNR) for N = 10 bits A2D, measured in dB, is given by a well-known equation:
In practice, the quantization noise is added to the A2D internal noise and harmonic distortion, resulting in a smaller numbers of usable bits as follows: SINAD(dB) = ENOB * 6.02 + 1.76,
where ENOB is the effective numbers of usable bits and SINAD is the signal to noise and distortion ratio. The accuracy of the A2D highly depends on the quality of the clock being used. A clean and low jitter clock translates to an ENOB value closer to the theoretically computed value. The SNR as a function of clock jitter is equated by the following equation: where f analog , expressed in Hz, is the sampling rate and t jitter , in seconds, is the RMS value of the jitter. From the previous equation it follows
For a 10 bit analog-to-digital converter [6] , at 22 MHz sampling rate, the above equations (2) and (4) leads to the following.
(i) Maximum theoretical SNR is 61.96 dB.
(ii) Theoretical SINAD with 8.77 ENOB is 54.6 dB.
To achieve the theoretical SNR, the jitter of the sampling clock must be less than 0.5 nanoseconds. Also, the external noise to the A2D plays a crucial role. Minimizing the switching noise from the power supply such that the A2D Power Supply Rejection Ratio (PSRR) stays in the ±0.4 LSB, as specified by the A2D specifications, requires special design rules to be taken in consideration. First, improving the AFE total SNR was possible by employing independent power supplies for the analog side (+3 V) and for the digital side (+2.5 V) of the AFE. Second, we created an analog path for the sampling clock. Separating the analog from digital ground on the PCB in the AFE region usually does not bring the expected results because it creates large ground loops between the analog and digital sections. We choose a common ground plane split into analog and digital regions instead. However, keeping the ground plane noise free, in a small PCB size mixed signal design, is a major challenge.
Each ZIF transceiver integrates the Low-Noise Amplifier (LNA), the digital gain control, Voltage Controlled Oscillator (VCO), fast settling Sigma Delta fractional N synthesizer, and the programmable baseband filters. A simplified functional block diagram is illustrated in Figure 6 .
The frequency reference for both transceivers is provided by a single clipped sinusoidal frequency, derived from very low phase noise oscillator.
Jitter and phase noise are different ways of quantifying the same phenomenon: the measure of the uncertainty at the output of an oscillator. Jitter is the time domain measure of the timing accuracy of the oscillator period and phase noise is a frequency-domain view of the noise spectrum around the oscillation frequency. There is not any known correlation between all sources of jitter; thus jitter cannot be predicted in practice [7] . In communication systems, the reference frequency phase noise will directly impair the overall performance by increasing the error vector magnitude EVM of the demodulator [8] . Clipped sine wave exhibits less harmonic content; thus the induced noise in the analog section is less. To keep the noise as low as possible, the PLL circuits are supplied separately from an ultra-low-noise LDO with a PSRR of about 54 dB at 10 KHz. For the analog clock distribution we choose 50 Ohm impedance traces with simple DC blocking capacitors between the TCXO output and transceivers' PLL clock inputs. Running a strip line for the clock trace between two adjacent ground planes will significantly lower the clock noise, interference, and reflection (the load impedance is matched to the trace). Careful design of the clock distribution network is required to minimize the phase noise. Any clock distribution IC, based on our experience, will add extra phase noise. The phase noise degradation can be dramatic. For instance, a TCXO, with specified phase noise better than −145 dBc/Hz at 10 KHz, may loose more than 20 dBc if the signal is buffered even with very low jitter and skew buffers. The reason is because the dominant noise type is flicker of phase, with a slope of 10 dB/decade, at around 10 KHz offset from carrier (see Figure 7 ) which will be amplified by the buffer's jitter component. To avoid the extra noise added by the clock distribution network we choose to use two separate oscillators one for the RF front end frequency reference and the other for the sampling clock.
Circuit Board Design and Noise Minimisation
Circuit board design complexity increases with increased component density and smaller board size. The lowest pitch component, in other words the largest ball density, determines the layer stackup configuration for the best optimized trace escape solution under the BGA packages.
International Journal of Digital Multimedia Broadcasting Even though there are available standard layer stackup configurations, trace density on small size designs often requires nonstandard solutions for the printed circuit board design. Our board is designed on a 12-layer stack as presented in Figure 8 , using 0.008 (0.20 mm) mechanical buried vias, 0.010 (0.25 mm) through hole vias, and 0.004 (0.10 mm) laser drilled and filled microvias. The overall board thickness is about 0.040 (1 mm). The critical routing component is the SB3500 SoC with 529 balls distributed on 11 × 11 mm array with a 0.5 mm pitch. The primary escape layers for the SB3500 signals are L1, L2, L3, and L4. Supply layers are L4-L5 and L8-L9 which are using a buried capacitance (ZBC2000) prepreg between them. Layers L7 and L8 are used for clock and differential routes. Layers L10 and L11 are carrying signals while layers L1 and L12 are used for component placement and ground plane. All signal layers also carry ground planes to increase the noise immunity between signal routes [5] . For the SB3500 escaping signals, triple stacked microvias are necessary on layers L1-L2, L2-L3, and L3-L4. The board must be built symmetrically to equalize interlayer stresses as manufacturing process requirement (to prevent warpage) hence the existence of stacked microvias on L9-L10, L10-L11, and L11-L12. Buried vias are used for transferring signals from L2, L3, and L4 on lower signal layers (L10, L11) but also to create shorter paths for the filtering capacitors placed on the bottom layer L12, below the BGA packages placed on L1. All power supplies are using planes for routing, distributed on the supplying layers. The small power supplies are using signal layers L3 and L10 for routing. The PCB component placement for this SDR design is shown in Figure 9 .
In digital systems, filtering capacitors are used to suppress the noise generated by the switching clock at least up to the third or fifth harmonic. The high-frequency noise component must be suppressed near or as close as possible to the source. Sometimes this is impossible, as the noise source, the supply ball of a PLL from a BGA, for instance, can be reached only with a trace which becomes an RF emitter. Figure 11 : QPSK WiMax TX spectrum at +21 dBm output power. (5) with minimum value at the resonant frequency:
where C is the capacitor value and L is the parasitic inductance of the capacitor as given in the specifications. From (5), the best filtering capacitor will have ESR = 0 and the lowest possible ESL. The filtering capacitor will have the best noise suppressing at the frequency which will minimize the equivalent impedance. Lower equivalent cap inductance and resistance is achieved through short connection traces between the cap terminals and BGA balls. Unfortunately, routing the BGA power balls to the ground and power planes is done using traces and vias which are both inductive and resistive. Physically, a capacitor can be installed on the same layer with the BGA balls, near the BGA package or on the opposite side, below package. In both situations the capacitor is requiring at least two vias to connect the capacitor terminals to the power planes. Equation (6) and Figure 10 tell us that in order to suppress a large noise frequency range requires two or three standard capacitors with different parameters connected in parallel thus, increasing the number of capacitors to about 500 for a quite small board and, making almost impossible the low-noise routing without even taking the added cost in consideration.
One solution to this problem is the reversed geometry low ESL filtering capacitors. Figure 10 shows the difference between using three standard filtering capacitors mounted in parallel (330 pF, 10 nF, and 0.15 uF), versus two parallel 10 nF and 0.1 uF reversed geometry low ESL capacitors. As seen, the smaller impedance, in the frequency range 25 MHz-1 GHz, is achieved by capacitors 4 and 5 connected in parallel. This way, the total number of filtering capacitors was reduced to about 2/3 compared with standard capacitors. Using highvalue capacitors (1 uF-2.2 uF) for the lower-frequency range is still necessary, but the number of capacitors is small and equally distributed on the printed circuit board.
Platform Validation
Design validation was performed against the WiMAX Forum Mobile Radio Conformance Tests (MRCT) [10] for Category 1 mobile station. Next, we reproduce the most critical measurements as Error Vector Magnitude (EVM) at the maximum transmit power, spectral mask emission, and maximum receive sensitivity, defined by the standard. The measurement results are illustrated in Table 1 . The maximum EVM required by the MRCT specification is −24 dB (6% RMS) while we measured −27.95 dB at 21 dBm transmit power. Figure 12 illustrates the Vector Signal Analyzer (VSA) screen capture for a QPSK waveform at 21dBm transmit power. As shown in Figure 11 , at 6 MHz from the central carrier we measured an attenuation of 19 dB, compared to 13 dB required by the WiMAX standard.
For WiMAX Wave 2, the total processor utilization is around 75% with all cores running at 600 MHz.
Conclusions
We presented a 4G low-cost SDR platform based on the SB3500 DSP from Sandbridge Technologies. Practical design considerations as well as physical measurements and performance data were described throughout the paper. As far as we are aware of, this is the only existing low-power, low-cost, positive gain-MIMO antenna based- [11, 12] 
