Recently, a number of companies have announced that they are developing new devices for signal processing based on novel architectures. picoChip have recently sampled just such a platform, together with complete toolchain & systems library. This product is based on a massively parallel array of heterogonous processors and delivers extraordinary computational power, eliminating many of the constraints imposed by more traditional architectures. This paper will discuss how parallel architectures can be applied to demanding signal processing systems, and what are the trade-offs with such an approach. It will discuss how such devices can be programmed & how they fit into a traditional development environment. The discussion will be based around the case-study of a real design exercise for a 64 channel 3G (WCDMA) basestation.
Introduction
Everywhere in communication systems, increasingly sophisticated algorithms are being used to support higher data rates and richer services. This is true in all application areas, but perhaps most visibly in mobile, where the move to third generation is driving significant changes in component design for telecoms equipment. In addition to basic voice and messaging, UMTS paves the way for telecom operators to offer sophisticated data oriented services that industry analysts predict are essential for revenue growth over the next decade.
As people strive for higher data rates or longer reach over fixed channels, data rates get ever-closer to Shannon's limit and more sophisticated algorithms are required. Indeed, the requirement for signal processing is rising ten to a hundred times faster than Moore's Law can deliver increased performance. Estimation and detection algorithms in today's communication systems require the number of operations per second to grow by a factor of ten every four years; that compares to the increase in processor speed from Moore's law of a factor of ten every six years [1] .
Worse, while Moore's law holds well for general purpose processors and memory, the difficulty of integrating ever bigger systems means that the growth curve for complex System-on-a-chip ("SoC") ASICs is significantly slower -"the design gap"-with a CAGR of 22%.
Not only must equipment deliver improved performance, design times are under pressure and budgets are stressed, often in an environment where standards are shifting.
As previous articles have discussed, a fundamental change approach is required, and a growing awareness of the attractiveness of reconfigurable DSP, flexible architectures or other software defined radio (SDR) systems. Makimoto's wave [2] would suggest such a transition is overdue.
Most discussions of these techniques include qualitative terms such as "efficient", "optimal" or "cost effective" [3] , [4] . Unfortunately, without analytic justification these statements are often little more than optimistic beliefs. This paper provides a structure that considers all the various "cost drivers" and integrates them to allow a quantitative model to be constructed reflecting the Total Cost of a given development.
The analysis is based on a standard 3G (WCDMA) basestation, and focuses on the baseband signal processing, including chip rate, symbol rate and baseband control functionality. This has been chosen both because of its commercial importance but more relevantly because its complexity probably represents the "high water mark" of signal processing complexity.
While this paper considers development from the perspective of a manufacturer, an analogous model has been developed for operators, reflecting the strategic value of flexibility & software upgrades to them.
Components of Total Cost
There are a number of elements in a development that feed into the overall cost of that development. Crucially, nearly all of these are determined at a very early stage in the design cycle; some estimates [5] suggest that 90% of the cost structure of a product is determined within the initial specification. In the context of an SDR we would wish to consider explicitly the impact of including "reconfigurable" in that specification.
Direct Cost
This is the simplest and most obvious element in the cost of a system. For a silicon device, the primary driver of the cost is area, both because with a larger die more wafers are required, and because yield falls.
In general, it is here that the benefits of a reconfigurable solution extract their penalty: there is a trade-off between flexibility and price such that (for equivalent performance) a flexible part will cost significantly more than the equivalent fixed function part. For example, on a simplistic basis, the unit cost of a 'hardcoded' ASIC may be 100 times less than an equivalent FPGA.
Amortised Non-Recurring Expense: Direct or Product
The second major cost is direct NRE -the "set-up" or "tooling" costs that are directly associated with a specific project. Most obviously, for a dedicated ASIC this includes mask-making and fabrication. This is increasingly expensive. Recently, Aart de Geuss, CEO of Synopsys, was quoted as saying "The cost of designing a complex SoC in copper on 0.13u is $10.9million, and we are reaching the point where testing the chip may be more expensive than fabricating it" [6].
While flexible approaches have a penalty in higher unit cost, they benefit from requiring no significant specific tooling or set-up charges. There is a straightforward "Break Even" volume at which the NRE and unit costs balance; above this point costs are amortized and an ASIC is economic, but for smaller runs an FPGA is better. Importantly, as technology advances and process geometries shrink, this breakeven volume increases dramatically (perhaps a 10X increase between 0.25u and 0.13u).
Allowing for risk this implies a very long production run is required, and that leads into another worry; a significant concern in "leading edge" environments is that the design life must be long enough to reach this economic volume. In the context of 3G standards are evolving so fast (Release 99 to R4 to R5) that while costs are incurred the design may be obsolete before production...
Amortised Non-Recurring Expense: Indirect or Global
A related, but more subtle cost relates to the development environment. This may be more mental/indirect than the costs of mask making, but is potentially more important. This is familiar from the loyalty developers have to processors or architectures (eg the ARM processor for mobile handsets, or the PowerPC in networked communications), where the investment in tools, trusted code modules and "mental furniture" is so huge that the cost of changing is too high to even be considered.
This may, in the long term, be a more significant benefit from a flexible architecture: not only can costs be spread across a product, but they can be spread across a whole family of different products. The core remains the same, but the unit can be reprogrammed to adapt to new needs. A reconfigurable part (or perhaps, a family of parts) that can be used in many products over different segments can be more useful than one that is overly focussed.
Power
Power dissipation has a significant impact on the total cost of a system. As power consumption increases, not only are larger supplies required but the cost of cooling increases dramatically (passive vs fans or chillers). Significantly, dissipation can limit the number of channels or cards which can be installed, so the unit cost rises rapidly as the fixed costs (controller, PSU, rack, floor-space) are spread over fewer units. Additionally, increased power dissipation and increased temperature accelerate device failures, with significant cost implications.
Time
Another thing to consider is development time. "Time is money" may be a cliché, but it is still true. Obviously, development time costs, but potentially more significant is "Time To Market" (TTM). Many studies have shown how a delay in launch can have a major impact on sales volume, and an even more dramatic effect on lifetime profit; some studies discuss launching six months early triples profits over the "expected" schedule -and a six month delay results in little more than breakeven.
Value of Flexibility
There is a tension between optimisation and reconfigurability. A stable product, with high volumes, will suit dedicated/optimised ASIC components. Conversely, flexible and reconfigurable devices have value when the application is confused, the exact technology requirements are in a state of flux, or the volume is insufficient or uncertain. Figure 1 illustrates this. As demand for a product grows, peaks, and decays the value of having a flexible architecture shifts. Considering the progress of a technology in a market, successful suppliers will adopt different strategies at different phases in market acceptance [7] .
In early stages of a technology adoption, it is very important to be flexible, since the precise standards and requirements are unclear; the successful developer will thus seek to exploit technological options to differentiate themselves and define the product that best meets customers' needs. This phase ends after the market has transitioned to a mass-market (ie "Crossed the chasm" [8] . In this phase, there is little value in flexibility or variation, instead the emphasis is on operations to keep up with the explosive growth in demand and to reduce prices -the emphasis is on cost leadership. In this phase, flexibility is of very little value (but note that a flexible architecture may still be very valuable as a tool to reduce cost). Finally, as the market matures, cost tends to stabilise. In this phase, the important thing is to identify particular niche requirements, and target products to those -in other words, focus is key. The application described here is a 3G (WCDMA) basestation., but the analysis would be very similar for other complex technologies (cdma2000,TD-SCDMA, next-generation DSL, broadband wireless).
Historically, basestations used ASICs to provide dedicated processing power, but the fluidity of 3G standards is driving development towards reconfigurable systems.
To develop a complex ASIC for a base station in contemporary silicon technology requires a large team of engineers: perhaps $100m development cost and a 36-month design gestation. If there is a design problem, the re-spin of the device will increase cost & impact time-to-market. Given the pace of standards evolution, and the realities of the market this is an unacceptable prospect.
Focus:
Late market; requirements understood; flexibility used to deliver special features to add value to niche opportunities
Cost Leadership
Mass market; emphasis on deploy in volume; optimise cost.
Differentiation
Early adopters; immature technology; flexibility required to develop successful product. The ASICs typically handle the chip-rate processing which requires speed but is bounded enough to be stable. The DSP handles the symbol-rate processing of the complex protocols and decodes the data in the stream, and offers flexibility, while FPGAs are used to deliver both flexibility and speed. These clusters are typically controlled and managed by a powerful RISC processor such as a PowerPC.
FPGAs now offer million gate densities and Gigabit per second interface speeds, and high-end devices can now be used for chip-rate processing. They are extremely versatile and universal -the same device may be designed-in to a router for layer 3 MAC processing, or DSP PHY in a Node B. However, this universal applicability implies a trade-off with optimisation for a particular role, while the low-level granularity they offer is ill-suited to efficiently and quickly implementing complex tasks DSPs clearly have flexibility, but traditionally did not deliver the sheer computational horsepower required in these systems. In order to achieve performance, complex pipelines and multiple instruction execution units are employed -requiring a sophisticated compiler to exploit them; and while contemporary DSP compilers are more efficient than previous generations, there is a performance impact which makes it hard to realise the theoretical MIPS. Algorithm-specific hardware accelerates computationally intensive functions, but even at 600MHz the performance is insufficient to replace an FPGA or ASIC. The solution is to use multiple devices, and some processors are expressly optimised for such multi-processor systems. The drawback (apart from the sheer cost of using many expensive processors and the power dissipation) is in the difficulty of coordination, programming and verification; partitioning algorithms across multiple devices and ensuring they are in synchronisation is a challenging task. Indeed, it is usually impossible to give deterministic performance metrics, and massive testing is required to give statistical assurance.
Both FPGAs and DSPs are very general-purpose, and this "universal" flexibility has an expensive price tag. Integrating a large number of these devices is challenging & test/verification is a major problem. Worse still, although the individual parts are programmable in isolation, the delicate integration tasks often results in a "house of cards" where the whole system is so delicately balanced upgrades are not possible.
Reconfigurable Architectures
Recognising the above issues, a growing number of companies have developed specialised reconfigurable baseband devices, striking a different balance between wide applicability & optimisation, to deliver efficient solutions to just this problem-set. To wildly simplify, there are three approaches:
The first, which might be called "FPGA+" is to add a number of higher-level or higher complexity functional blocks to a general purpose device to optimise it for a specific purpose such as wireless. Examples of this may be Chameleon, Elixent or the newer devices from FPGA suppliers. These may include some very rich function blocks or architectures. While supporting functionality at a higher level than conventional FPGA, and with a great degree of versatility, these devices still share the flexibility of that path and facilitate OEMs to include their own IP. However, as general purpose devices there is a trade-off of universality versus suitability for particular applications.
A second approach is to develop a reconfigurable system based around a "Programmable Application Specific Standard Product" (P-ASSP), which consists of a general purpose core supplemented with a number of optimised coprocessors or kernels (e.g. for the multi-path searcher or equaliser). A good example is the 'Wireless Systems Processor' of Morphics, described in [1] ), where kernels are targeted to support a class of operations found in a set of. Each kernel is implements both dataflow and control tasks associated with a task, contains sufficient memory for that role and implements local communications through a configurable interconnect.
This has clear attractions in terms of power dissipation and cost, in that some large and complex blocks will have been optimised to perform specific operations. In addition, the use of multiple devices is well aligned to the parallel nature of the application.
However, an obvious concern is that the flexibility and reconfigurability of the system is totally limited to the functionality coded into these blocks. Given the pace of change of systems and architectures, it is hard to be confident that any hard-coded kernel will indeed be suitable (or even usable) when the requirement for an update arrives. Additionally, in many cases the core IP and expertise resides in the system manufacturer, which is then difficult to include in the device.
A third approach is based around a parallel array of processors is used by Picochip, QuickSilver or PACT. Instead of the legacy DSP aproach, with a small cluster of very powerful discrete CPUs exploiting instruction parallelism, these have a very large number of "appropriately sized" devices on a single die interconnected by a very fast on-chip fabric. In such a computing fabric, tasks can be mapped directly onto CPUs almost as easily as drawing a block diagram. An attraction of this approach for signal processing (as opposed to more general computation) is it matches inherent parallelism both within the DSP algorithms and across them for multiple data streams. In contrast, a conventional processor wastes effort converting parallel structures (across users, across DSP algorithms) to a serial instruction stream, executing it, and then reversing the process. In the 3G context this process incurs latency that directly impacts capacity in inner loop power control.
Briefly, PACT has an array of 32-bit/floating point processors supplemented by RISC cores, with its own programming paradigm. QuickSilver is based on an interconnected array of configurable processors. Indeed, the processing elements themselves are configurable arrays of interconnected execution units that QuickSilver describes as "a fractal architecture". It employs four types of execution nodes: scalar (actually a MIPS RISC core), arithmetic, bit manipulation and finite state machine nodes. The Picochip's device features a deterministic high-speed switching matrix linking a heterogeneous array [9] of many hundred programmable 16-bit processors to handle the high-speed chip-rate functions, the lower-speed symbol-rate processing and the control functions in a single structure. In a parallel device, it is important that the granularity of these elements is well aligned to the tasks within a communications system, striking a balance between the very fine granularity of a universal FPGA or the "big chunks" of a powerful DSP. In general, there are two distinct classes of operations:
• Dataflow -where operations will be regular and predictable (whether stream or block) and may be fast (eg chip rate processing). This will typically require many elements "clumped" together, and it is important that interconnect arrangements are both fast and deterministic. There is a large degree of parallelism (both within algorithms and across multiple instances). These are the operations typically done in an FPGA (if high speed is required) or in a DSP.
P Switch Matrix
Inter-picoArray Interface I P P P P I I P P I P P P P + P P P P P X P P P P P P P P P P P P P P P P P P P P P
I

Processing Element
• Control -which are "diffused" across the entire system and must interact with many individual blocks. Typically these tasks are individually quite simple, but can be aggregated together. This code will be serial, and will need to support many different options or switches for specific cases or modes. Conventionally, these operations may be implemented in a DSP or in a powerful microcontroller, and will likely be coded in C to facilitate complicated modes and options.
The mismatch between these tasks and the disparate implementations is often what makes integration a nightmare for large systems. Not only must hardware & software be brought together but so must three different types of hardware (FPGA, DSP & uC), coded in fundamentally different lanaguages/philosophies (VHDL, DSP assembler, some form of "C for DSP", RISC assembler and RISC C -perhaps with an RTOS too) with no integration or support for unified debugging. In contrast, in the picoChip solution, all of these tasks are implemented in the same device, using consistent tools and a completely unified development environment.
At one level, the structure of an array and the arrangement of the different element types across the chip should reflect the balance of requirements of a wireless system. Secondly, the characteristics and instruction set of elements should include support for specialist operations such as spread/de-spread or compare-add-select, or larger memory complements for control type operations. One final issue to discuss is the definition of "reconfigurability", which is an elastic term, used in somewhat ambiguous ways. Some applications require intermittent or rare changes -for example, updating a major piece of network functionality for significant algorithm improvement will be an infrequent occurrence refitting significant testing. This is better termed "reprogrammable" or "field programmable" by analogy with an FPGA. Other applications may change more often, perhaps between different discrete applications (for example, Elixent discuss using a reconfigurable fabric embedded in a consumer device to change between an MP3 player and image compression for a camera). At an even more extreme case, QuickSilver have discussed changing functionality at a 60kHz rate, allowing different functions within the same system to be switched in "on the fly" as required. However, as discussed below, verification is a problem faced by all system designers; adding this element of dynamic flexibility may not help the situation.
ASIC
The picoChip architecture is reprogrammable, and may be used in the following typical tasks: -Updating a basestation to a new revision of the standard (eg from Release 4 to Release 5) -Implementing algorithm improvements or feature upgrades -Fixing bugs or solving interoperability issues in the field -Dynamic updates of base-station characteristics, perhaps matching processing resources to traffic patterns to optimise revenue. 
Programming and Verification
The system developer must guarantee reliability under all circumstances, even when contention for resource arises (for example, a new subscriber entering the cell and demanding bandwidth for a high speed data call), and that no information is ever lost or corrupted.
The conventional approach (mixing ASIC, FPGA and DSP) complicates design as resources are distributed across multiple architectures, with different tool-chains & different modes of behaviour. Given that the availability of resources at any given time cannot be guaranteed, the only way to ensure operation interrupt or contention scenario is exhaustive testing. This process typically severely impacts development time, implementation schedules and ultimately costs.
Several commentators have identified development & verification as a critical challenge: "while reconfigurable devices have attractive hardware features, unless they can deliver a 'comfortable' programming environment for engineers and (critically) a powerful and trusted simulation/verification regime, they will not be used".
The goal is to achieve a system where the designer can accurately predict the final performance from cycle-accurate deterministic (rather than traditional statistical) simulations.
The parallel based devices should have the potential to do this. Their architecture reflects the inherently parallel nature of the task, making it easier to align tasks with functional blocks. Secondly, the granularity of structure aligns well to the mix of data and control functions, allowing control functionality to be "bundled" closely to an associated dataflow region without needing complex communications protocols or inter-chip signalling. Finally, because they are intended to address the entire system task,with a single tool-chain, the system integration problems of mixing different devices/architectures should be eliminated.
For example, the Picochip toolchain for both control tasks & signal processing blocks is based on standard C & tightly integrates design, compilation & verification. Deterministic performance was a major focus, and resources are all allocated at compile time (not runtime -ie no scheduling or arbitration) to guarantee a simulated result that is both bit-accurate & cycle-accurate.
Figure 4: A Comparison of Development Times
One of the major impacts from verification & test is on project time, which in turn drives development cost.
Comparison Of Different Approaches
A simple summary of performance comparing the PC101 with a number of other devices is shown in Table 3 . While it is well known that MOPS or MMACs are a simplistic measure of performance and should be treated with care (especially when comparing dissimilar architectures), the results are striking. This kind of measurement usually overestimates actual performance, as a substantial portion of processor resource is devoted to "housekeeping" and resource management (packing and unpacking data, switching between tasks, managing contexts & the like). However, the picoArray (and some of the other parallel approaches) do not suffer from this; tasks are allocated across dedicated processors, each of which can operate continuously & in a well-optimised way. As such the "true" performance is much higher.
In a more detailed approach, benchmarking different approaches is described in [10] . This analysis focuses on the underlying efficiency of each, considering the computation density and power consumption for two representative algorithm, a FFT and a Viterbi decoder 
Figure 5 (a): Computational Density of FFT (b): Energy Efficiency of FFT ; (c) Computation Density Comparison of Viterbi decoder); (d) Energy Efficiency of Viterbi Decoder
It is interesting to compare the differences within these benchmarks. As expected, it is clear that the ASIC delivers significantly better efficiencies than the more general purpose devices. However, it is noteworthy the differences in performance within the reconfigurable processors, with about a ten-fold difference between best & worst across both energy efficiency & computational density. It is also noteworthy how performance can vary across benchmarks (say, with different size FFTs). However, the trends as to which architectures are most efficient (in area or power) are clear.
In terms of computational density, the picoArray is clearly the most efficient with between a 3X and a 10X advantage over the next best reconfigurable technology, and as much as two orders of magnitude over a legacy DSP. In terms of power efficiency, the picoArray is second to a device optimised for handsets, but better than any of the other recongigurable parts. (It is not surprising that a fixed funvtion ASIC is best in both dimensions, but this analysis does not consider the costs of inflexibility or NRE).
It is worth noting that some of these devices do support some further optimisations that are not reflected in these results. For example, the picoArray includes "spread/despread" instructions that can replace 50 conventional instructions in a CDMA receiver.
While these results are important & illuminating, it is worth noting that architectural comparisons should not be linked only to discrete computational kernels, but to the efficiency of a representative complex system. The instantiation of a large basestation, with associated control, diagnostics and "in circuit" measurements implies a significant requirement for scalability. While devices in isolation may have impressive performance, this is only significant if it can be sustained in such a context, allowing for intraelement signalling & measurement. In this application the parallel approach has further advantages allowing close integration of control & data functions across a system.
Results of Implementation of a Node B
Figure 6: the structure of a 3G base station
The design implemented is a fully featured "production quality" design, not merely a benchmark exercise, with 3 sectors, 64 channels per cell, 30km cell radius, 2 antenna per sector and full diversity on both transmit & receive (every antenna can be connected to every channel). The primary blocks to be considered include multi-path searcher, RACH preamble detect, Rake (four fingers per antenna), TX and RX filters. This includes a high performance RACH detector, with all 16 signatures in every access slot, multi-bit resolution and support for high mobility. As well as data path & control functions, diagnostics & measurements to TS 25.215 are supported.
Current metrics may be that cost per channel of a WCDMA Node B is over one hundred dollars per user channel; the design goal is to reduce that five-fold. To implement this in a conventional way might require nine large FPGAs, three significant ASICs (say 80mm2 in 0.18u geometry) and 42 powerful DSPs.
A completely "flexible" architecture, with no hardwired ASIC at all, but exploiting general purpose DSPs would require nearer twenty FPGAs and one hundred and twenty DSPs. In contrast, the reconfigurable approach currently supports 64 channels with just 19 devices and a PowerPC. This required roughly half the development time, no NRE or tooling, and dissipates substantially less power.
Total Cost
To consolidate all of these different items is difficult and very situation specific. However, Figure 7 gives an illustration.
• In this case, only the pure ASIC (if there is high volume) & one of the reconfigurable systems could achieve the design goal, although another reconfigurable approach might do so in a best case analysis; none of the "traditional" approaches using FPGA and/or DSP would meet the cost requirements.
As important as the absolute cost is its variability, driven by spreads in schedule & production volume forecasts. Development time estimates reflect uncertainties in both design and test durations (eg statistical testing of DSPs). It is noteworthy that the reconfigurable approaches ensure more controlled development times & hence more predictable costs. In information theory, "surprise" is proportional to information value, but in business surprise is not viewed positively. Indeed, in financial theory, variation is explicitly proportional to risk, which then commands a risk premium.
The second element is the impact of production volumes, particularly for the ASIC approach with its high NRE & its longer design time. In the best case, with a long, stable, production run across which to amortise costs, it is unmatched. But in the worst case the combination high NRE from development and a limited volume (perhaps attributable to commercial problems following late launch) result in cripplingly expensive unit costs.
This analysis used only quantitative development data & excluded the "qualitative" or strategic value of reconfigurability. To a degree that omission could be said to have penalised an SDR approach, but even with this omission the economic benefits of flexibility can be compelling.
Summary
There is growing interest in the use of reconfigurable baseband processing, fuelled by the tension between ever-growing complexity of communications systems and implementation pressures.
As communications systems become ever more complex, but must be developed to strict budgets & time pressures, new architectures are increasingly attractive. This is amplified by the growing desire for reconfigurability to support new standards, multiple modes, or allow algorithm updates.
This paper considered the cost drivers of a 3G basestation and analysed different architectural approaches, based on actual benchmarks. Under a range of scenarios, reconfigurable solutions were the most economic and delivered the faster development time, with the lowest risk.
Furthermore, once reconfigurable equipment is deployed in the field, systems tuning, functional changes and standards migration (which otherwise require expensive hardware modifications) can accommodated purely in software. This scalability, cost-effectiveness and ease of development are vital for the successful rollout of 3G services around the world.
There are many different approaches to addressing design problems, and it is naïve to claim one is optimum in all cases. However, the use of heterogeneous massively parallel devices with a deterministic performance and has many attractions. In particular, the ease of programming and verification across a variety of different functions has been shown to accelerate development, while the extremely high computational density reduce power and unit price.
