Calisto integrates the components of a multichannel gateway into a single-chip lowpower communications platform that uses multiprocessor parallelism supported by extensive on-chip memory systems. Its power efficiency and programmability enable high channel density and flexibility, achieving 30 times the density of prior designs within existing power and volume envelopes. Telecom providers are deploying two generations of Calisto architecture, the Broadcom BCM1500 and BCM1510, in enterprise and carrier-class communications systems.
Multichannel communications gateway
We designed the Calisto platform to meet the requirements of multiple gateway applications, including
• VoIP, voice over asynchronous transfer mode (ATM), voice over digital subscriber line (DSL), and voice over cable; • media gateways and servers; • trunking, 3G wireless, and cable and DSL gateways; • remote access servers;
• IP class 4 switches; • IP PBX systems; and • enterprise voice and video servers.
Circuit-switched telecom networks carry each voice call on a full-duplex, 64-kbps digital signal 0 (DS0) channel using 8-bit pulse code modulation (PCM) voice samples at 8 kHz. Carrier-class gateways must handle thousands of DS0 channels within a severely limited power budget. Thus, increasing channel density by a factor of 30 requires decreasing the power dissipated per channel by a factor of 30.
We aimed to provide 60 heavy-service DS0 channels or 240 light-service DS0 channels per chip, while dissipating less than 1.5 W. Such efficiency would enable an optical-carrier 3 (OC-3) gateway, equivalent to 2,016 DS0 channels, on a single blade (printed circuit board). We expect to scale this architecture to an OC-12 blade in subsequent generations.
Gateway blade
Early VoP gateways were low density, typically handling only 24 to 48 DS0 voice channels per blade, and dissipating several watts per channel. 4 They used discrete digital signal processor (DSP) "farms" to implement voice coding and modem signal-processing functions; additional DSPs or application-specific integrated circuits (ASICs) for echo cancellation; memory chips; and reduced-instructionset computing (RISC) microprocessors for packet protocol processing and supervision. Their channel density was limited primarily by thermal power dissipation constraints. Complex multiprocessor software coordinated the components, and implemented signal and packet processing. Figure 1 shows a Calisto-based gateway blade architecture, which includes a simple array of Calisto chips, a packet switch or aggregation bridge chip, and backplane interfaces. Generally, telecom chassis backplane interconnects form a proprietary cell-switched ATM or packet-switched IP network and a circuit-switched, PCM, time-division-multiplexed (TDM) bus.
Multiservice gateway stacks
Software service stacks for communication gateways are application specific and require a real-time mix of digital signal and network protocol processing. To minimize hand-coded assembly language programming, we wanted to implement and maintain these complex service suites as compiled C software. Each DS0 channel's gateway services-voice coding, adaptive echo cancellation, and fax relay, for example-are independent of other channels and can vary dynamically and on demand during a call. Figure 2 shows an example gateway service stack for one full-duplex DS0 channel. Packets flow downward from the packet-switched network, and the gateway service stack processes them into frames of PCM samples, then outputs these frames to the circuitswitched network. Concurrently, frames of samples flow upward from the circuitswitched network, and the gateway service stack processes them into packets, and outputs these packets to the packet-switched network. Selected service modules in the stack process the upward and downward data flows.
To determine Calisto platform requirements, we developed detailed requirement Each service also requires sufficient memory to maintain its channel state between frame-(or packet-) processing intervals, and enough working memory to process frames. The interframe memory required ranges from 4 to 20 Kbytes per DS0 channel. Because available processor MCPS or available memory can limit the platform channel density, we needed to balance the platform performance and memory capacity to efficiently use power and area.
A multichannel gateway must provide any configured service on any channel at any time. Thus, it must always be able to concurrently execute any of the service modules, which accumulates to a total active program image of 200 to 700 Kbytes. To minimize the area and power of such a large program memory, Calisto processors share one program memory.
Multiprocessor architecture
Calisto's multiprocessor architecture efficiently handles multiple channels. For each DS0 channel in a gateway application, the platform periodically executes a per-channel task that processes one frame of channel data traveling upward and one frame traveling downward. Figure 3 shows a periodic task timeline for a full-duplex DS0 channel with a 5-ms frame period, which is equivalent to 40 PCM voice samples.
In the figure, frame f 1 arrives from the circuit-switched network during the first 5-ms frame period, while packet 1 arrives from the packet-switched network and waits in the jitter buffer. A task that processes upward frame 1 and downward packet 1 executes as early as possible after the ingress frame arrives to minimize latency, and schedules an output frame and an output packet at the lowest processing latency the gateway platform and software can maintain without jitter.
The frame period depends on the selected voice coder; typical periods are 5, 6, 10, 20, and 30 ms. Short frame periods are the most challenging. For example, at a 166-MHz DSP clock rate, using 11 to 40 MCPS workloads per DS0 channel, a 5-ms frame period implies 800 to 3,000 tasks per second per DSP, four to 15 tasks per frame period, and DSP task runtimes of 333 to 1,250 µs.
To eliminate voice quality artifacts, gateway VoP processing must have low processing latency, and the delay at the output must have low jitter. The permitted incremental processing latency ranges from 1 to 2 ms, with output packet jitter under 250 µs, and zero TDM jitter. The challenge for a multichannel platform is to consistently achieve these figures for all channels and all services.
Because the DSP, packet, and OS work leaves little time for task-switching overhead, we designed hybrid DSP/RISC processors to process all services. This is preferable to separating DSP and packet processors, which frequently need to interact and synchronize. We can readily separate supervisory OS work from DSP and packet processing, however, so Calisto implements supervisory processors to isolate OS overhead. Table 1 summarizes how we mapped the gateway requirements and application attributes to key platform architecture features.
Programmable platform
The Calisto BCM1510 platform consists of four identical clusters and shared platform resources, as Figure 4 shows. Replicated processor and memory clusters exploit multichannel application parallelism, reduce interconnect distance and power, and simplify design and verification. Calisto central resources include an SDRAM controller, a main processor, a platform synchronization hub, and a 768-Kbyte shared memory (described later). Peripheral resources include packet and TDM I/O interfaces.
The SDRAM controller supports an optional, external 32-bit SDRAM memory chip for applications having large data sets or programs, such as remote-access modem servers. However, the large on-chip memory capacity lets most media gateway applications avoid using external SDRAM, which is critical because using an SDRAM chip nearly doubles the power and area per channel.
The main processor is an instance of the cluster processor core. It runs the same distributed OS, supervises the platform, assigns cluster workloads, and interacts with external processors via the packet interface. The OS coordinates the clusters using semaphore locks and interprocessor interrupts in the platform synchronization hub.
Calisto's multiprocessor architecture lets it execute several independent tasks in parallel, exploiting the multichannel parallelism available in telecom applications. The architecture is scalable in several dimensions: the number of clusters, cluster memory capacity, number of SpiceEngines, cluster processor and SpiceEngine performance, and clock rate. A single clock domain for all but the I/O pin logic simplifies design and performance modeling. Cluster architecture Figure 5 gives a detailed view of a cluster. The cluster processor, four SpiceEngine DSPs, memory bridge, and I/O bridge share the 256-Kbyte cluster memory. Cluster memory holds channel task data, I/O buffers, processor stack space, and OS data. The distributed real-time OS runs on the cluster processor and manages the cluster resources.
Processors and bridges use the same byte address space to access cluster memory data, letting the OS communicate with SpiceEngine tasks via memory. Any processor in the cluster can access channel data without wasting power moving the data. Processors and I/O engines manage chained I/O buffer structures in cluster memory.
Because per-channel tasks are independent and don't share data, the cluster memory system omits a cache coherency mechanism. The OS instead uses uncached accesses for shared data structures, and it flushes processor write buffers.
We partition cluster memory into eight 32-Kbyte memory banks interleaved on 32-bit words. This organization lets block transfers and sequential accesses such as SpiceEngine vector loads access the banks cyclically, minimizing contention.
A pipelined crossbar switch connects the seven processor and bridge requesters with eight cluster memory bank arbiters, providing bandwidth up to 224 bits per cycle and allowing the processors to load or store data on every cycle with a three-cycle latency. Four banks would have sufficed, because we found a fully loaded gateway incurs cluster memory bank conflicts on fewer than 3 percent of accesses. Low-latency switch arbitration, memory access, and transit times give each SpiceEngine a four-cycle load-to-use latency without contention. Every cycle, the SpiceEngine pipelines one memory access, a scalar load/store, or one element of a vector load/store; this allows full bandwidth use of its 32-bit memory port.
The I/O bridge supports background I/O transfers without impeding the processors. A block transfer engine in the memory bridge supports background block transfers between cluster memory, shared memory, and external SDRAM. The OS uses block transfers to swap channel data between cluster memory and available shared memory or SDRAM.
The cluster processor is a Tensilica RISC core 5 with a 4-Kbyte instruction cache and a 4-Kbyte data cache, both direct mapped. The cluster processor's bus interface fills the caches from the 128-bit shared-memory port or the 32-bit cluster memory port. It runs the distributed realtime OS, supervises the cluster workload, and schedules signal-and packet-processing tasks on the SpiceEngine pool. The cluster processor handles all interrupts and offloads overhead from the SpiceEngines, freeing them to focus on signal and packet processing.
The cluster synchronization hub provides semaphore locks used by the OS to share data structures among the SpiceEngines and the 33 cluster processor. The cluster processor similarly uses the chip synchronization hub in Figure 4 to share data with other cluster processors.
The four SpiceEngine vector signal processors 6 each have a 1-Kbyte instruction cache and a 1-Kbyte vector register file. The memory bridge cache-fill engine fills SpiceEngine instruction cache misses via the 128-bit shared-memory port.
Shared memory
To minimize chip area and power, the shared memory holds a single program image of the distributed OS and application service software. All processors fill their instruction caches from the single program memory, making Calisto more efficient than DSP farms that replicate program images. The cluster and main processors can also access and share data in the shared memory. Figure 6 shows the shared memory, the shared-memory crossbar switch, and the instruction cache-fill paths to the clusters. This organization partitions shared memory into four banks and interleaves the banks on the 128-bit bank width.
The pipelined crossbar switch provides an aggregate bandwidth of 512 bits per cycle from shared memory to the four cluster requesters, the main processor, and the I/O bus bridge.
Round-trip latency through the shared-memory switch is seven cycles. With a fully loaded gateway, the shared-memory bank usage per cycle ranges from 50 to 80 percent, depending on the application software. Heavy DSP services access shared memory infrequently because the active loops hit in the instruction caches. Lightweight services, because they execute mostly sequential code and frequently miss the small instruction cache, use memory more frequently and experi- ence shared-memory contention with high channel counts. Because the shared-memory switch also connects to the SDRAM controller, processors can execute large program images and access data in an optional, external SDRAM. As Figure 5 shows, a cluster's SpiceEngines share the memory bridge cache-fill engine, which transfers 192-byte cache lines from shared memory to the requesting SpiceEngine cache. The large line size amortizes the miss penalty, for an average miss cost of 0.1 to 0.2 cycles per instruction (CPI).
35

MARCH-APRIL 2003
Multichannel packet and TDM interfaces
The Calisto BCM1510 I/O system is explicitly multichannel: Packet and TDM I/O interfaces handle up to 512 full-duplex logical I/O channels. The OS associates a logical I/O channel with a DS0 channel or with control. The I/O ports classify and scatter input data to the associated ingress ring buffer and gather output data from the associated egress buffer. interface to enable the switch or bridge chip to aggregate the communications from several Calisto chips. The OS manages logical packet I/O channel tables that associate an ingress and egress buffer in cluster memory with each channel. When an input packet arrives, the port classifies the packet header's logical channel number, using its entry in the table to write the packet to its ingress buffer and update the buffer pointer. The port assembles egress packets from chained buffer pointers, polling a head pointer in each cluster memory. Software monitors packet transfers by checking updated buffer progress markers. The TDM I/O ports support full-duplex serial data up to 32 Mbps, or 512 TDM DS0 time slots. The ports connect to circuitswitched and ATM adaptation layer 1 (AAL1) networks. The OS manages TDM logical I/O channel tables, which associate an ingress and egress ring buffer in cluster memory with each TDM channel. When an input byte arrives, the port uses the TDM time slot entry in the table to write the byte to its input ring buffer and update the buffer pointer. A similar process handles TDM egress. Software synchronizes with TDM transfers by scheduling channel tasks with the 8-kHz DS0 sample frequency and by monitoring updated buffer progress markers.
SpiceEngine vector DSPs
The Calisto OS assigns all signal-and packet-processing tasks to the pool of 16 SpiceEngines, which we designed to • minimize power and area, • consume few shared platform resources, • efficiently perform VoP DSP tasks, • execute compiled C code in real time, and • be simple enough to implement rapidly.
The SpiceEngine vector signal processor 6 has vector registers akin to those of vector supercomputers 7 and vector microprocessors 8 to hold vector and array data. Vector registers improve performance, decouple the processor pipeline from memory latency, reduce accesses to cluster memory, and reduce power dissipation. Figure 8 is a SpiceEngine block diagram. The 192-bit instruction-fill interface requests cache lines from the shared memory via the cluster's memory bridge and fills the 1-Kbyte, six-way-associative instruction cache. The SpiceEngine also uses the fill interface to load its 128-bit configuration registers.
The SpiceEngine has a 1-Kbyte vector register file, sixteen 10-bit vector address registers, 16 scalar 32-bit registers, two 40-bit accumulator registers, and a 33-bit multiplier product register.
Voice figurable 40/32/16-bit arithmetic logic unit (ALU). The data path also includes a configurable 40/32/16-bit barrel shifter and exponent unit, a bit field insert/extract unit coupled with the shifter, and a 32-bit ALU for simple instructions. Programs can configure the multiplier, ALU, and shifter for ITU saturation arithmetic. The load/store and vector load/store units execute load and store instructions that access scalar and vector data in cluster memory. The OS uses the memory management unit (MMU) to restrict stores to selected memory segments, protecting platform data from software faults in one channel.
Instructions and configurations
To reduce code size and area, the SpiceEngine has 24-bit instructions and issues one instruction per cycle. It has a RISC-like load/store instruction architecture for both scalar and vector registers. Unlike vector instructions that operate on the contents of an entire multielement vector register, however, SpiceEngine instructions operate on individual vector elements. Programs use perelement instructions in a loop iterating over successive vector elements, thus simplifying the processor and increasing the compiler's flexibility to software-pipeline multiple operations in a loop body.
Several SpiceEngine instructions select a 128-bit configuration register that augments the 24-bit instruction with parallel instruction control fields for each execution unit, similar to very long instruction word (VLIW) fields. Such instructions select one of seven configuration registers, and can perform up to 10 parallel operations per cycle, providing wide 128-bit instruction-level parallelism while fetching narrow 24-bit instructions at low power.
To perform common vector operations, a single-instruction loop repeatedly executes one configuration register, applying it to successive vector elements on each loop iteration. For example, one configuration register can compute a vector dot product in a single-instruction software-pipelined loop that multiplies two vector elements, adds the prior product to an accumulator, advances to the next pair of vector elements, decrements a vector length counter, and exits the loop when done.
The compiler or programmer explicitly defines and loads configuration registers from instruction memory prior to use, and hoists the load-configuration instruction above the usage point to overlap load latency with useful work. The SpiceEngine executes most instructions in a short pipeline with four stages-fetch; decode and read registers; execute; and write registers. Two additional stages handle pipelined loads from memory. The short pipeline avoids stalls on delayed branches. It does not implement interrupts because the cluster processor offloads any need for them. The few causes of pipeline interlocks or stalls (besides instruction cache misses) simplify verification, simulation, and performance modeling, but require correct instruction scheduling and the use of instruction delay slots.
Vector registers
Vector registers provide fast access to data structured as vectors, arrays, tables, and lists. The compiler or programmer uses vector load instructions to explicitly load vectors from memory into the vector registers, operates on the vector registers, and stores vector register results to memory. To make them a simpler compiler target, the vector registers are byte addressable (like memory) and allow any vector length. We found that 1,024 bytes of vector registers accommodate nearly all VoP algorithms without strip mining, which processes large arrays in strips that fit in the registers. The extra cycles needed to partition the exceptions were modest.
To provide zero operand access time, the multiported vector registers supply the pipeline with two operands per cycle, accept one result per cycle, and support a concurrent vector load/store transfer. Four vector address units and 16 vector address registers provide zero-overhead addressing for successive vector elements with signed strides.
The compiler hoists vector load instructions above the vector register usage point, serving as a software prefetch that overlaps the load latency with useful work. The vector load/store instructions transfer multielement vectors between memory and the vector registers concurrently with instruction execution. Interlocking the use of vector elements with the completion of their load reduces start-up latency. Vector registers decouple the pipeline from memory latency and eliminate loop unrolling, which some VLIW DSP architectures use to span the load-to-use latency.
Calisto's SpiceEngine vector registers provide significant power efficiency, delivering consistent DSP performance while using fewer memory accesses and less power than conventional memory-based DSPs. Filter kernels access the same data points and filter coefficients repeatedly, so placing them in vector registers improves performance while reducing power dissipation arising from memory accesses.
For example, the vectorizing C compiler uses vector registers, vector load/store instructions, and loops executing configuration registers to generate an N-point, T-tap finite-impulse response filter code that makes O(N + T) memory accesses rather than a memory-based DSP's O(NT) accesses. It sustains a performance of one cycle per multiplyaccumulate operation using minimal memory bandwidth, letting Calisto share a cluster memory among several SpiceEngines.
Vectorizing C compiler and tools
The SpiceEngine's vectorizing C compiler and tools leverage the vector and configuration registers from compiled C applications, minimizing the need for assembler code. Table 2 lists the compiler optimizations that helped obtain high-efficiency signal and packet processing from compiled C code.
The SpiceEngine C compiler's performance is within twice the cycle count of hand-optimized assembler code on out-of-the-box DSP C code, such as the ITU global system for mobile (GSM) adaptive multirate (AMR) reference voice encoder. This performance enables the implementation of complete real-time service stacks, including DSP and packet processing, in compiled C. For application software development, cycle-accurate simulators and tools for multichannel debugging, profiling, and performance tuning augment the compiler.
Although executing primarily compiled C code, a SpiceEngine requires fewer cycles than comparable single-multiply-accumulate DSPs. SpiceEngine's instruction cache, configuration registers, and vector register architecture let multiple processors efficiently share Calisto resources, minimizing power dissipation and area. extensive memory redundancy and laser-cut fuses that enable laser repairs during manufacture to yield production volumes. Broadcom began producing the BCM1510 in the first quarter of 2002. Table 3 lists the implemented components. Extensive tests verified the SpiceEngine processor against a cycle-accurate C++ model and a model based on the Vera verification language and tool. We used Verilog and gatelevel simulation to verify the full chip design, running the OS and software stacks on all 21 processors. We thoroughly verified the physical design and obtained completely functional Calisto chips on first silicon.
Low-power implementation
We used several approaches to reduce power dissipation in the BCM1510 while delivering the target performance. Besides focusing on low-power CMOS design, we used extensive clock gating, and also used techniques to minimize transitions on long, pipelined wires. We selected a modest 166-MHz clock rate to limit power while meeting the performance target, allow short processor pipelines, and permit rapid design using synthesized logic. The use of minimum-size logic transistors and highthreshold-voltage RAM transistors also reduced power dissipation. We designed custom blocks for some timing-critical paths and large RAMs. The custom RAMs supplied memory repair redundancy.
To improve efficiency, Calisto uses many small, simple DSPs rather than a few large VLIW DSPs. Using many small processors reduces capacitance and wire length, allowing the use of minimum-size transistors and buffers. Moreover, multiple processors support the fine-grained task parallelism required by multichannel applications. Programmers write most VoP C software to execute sequentially on a simple, single multiply-accumulate DSP, and the compiler can exploit a narrow single-multiply-accumulate data path more efficiently than a wide, multiunit, VLIW data path.
The shared program memory reduces area and power to below that used by comparable DSP farms and multicore DSP chips, both of which replicate program memory. The sharedmemory crossbar switch more than earns its area and power, and the instruction caches reduce memory access power. Sharing a cluster memory among the SpiceEngine DSPs and the cluster processor lets the OS schedule the SpiceEngine DSPs as a pool without wasting DSP cycles. The SpiceEngine vector registers substantially reduce memory accesses and power, while reducing memory contention. Thus, several SpiceEngines can efficiently share each cluster memory.
The Calisto OS runs periodic per-frame tasks to completion, eliminating task preemption and context-switching overhead cycles. We balanced Calisto's performance with its memory capacity for most gateway applications-that is, available cycles and bytes equally limit channel density, wasting few watts.
System performance
We can measure the performance of a multichannel communications platform by its channel density and power efficiency under a specified service workload. With hundreds of channels contending for platform resources each frame, minimizing the switching overhead among hundreds of tasks in real time challenges the OS and architecture.
Calisto operating system
The Calisto OS manages Calisto's distributed and shared resources. It presents a single- Figure 2 ) at runtime. It also maps each channel to a task that runs once per frame period, as Figure 3 shows. The developer need not be aware that multiple instances of a service are running. Rather, the OS allocates resources to each channel for its service stack. The OS and the SpiceEngine memory management unit restrict channel task access to their assigned channel memory, protecting other memory and rapidly isolating software faults. Calisto OS schedules tasks on a cluster-wide basis; tasks can run on any available SpiceEngine in the cluster. Tasks and processors need no affinity because the SpiceEngines share local cluster memory. The OS uses ratemonotonic scheduling 9 to prioritize channels: the smaller the frame period, the higher the priority. It assigns an available processor to the highest-priority task. To minimize wait time, the OS staggers the start times of channels assigned to a cluster across the frame period.
40
CALISTO COMMUNICATIONS PLATFORM
IEEE MICRO
To minimize periodic task overhead, the OS combines up and down data processing into a single task. Each task runs to completion, eliminating context-switching overhead. To limit a misbehaving channel's impact, the OS terminates tasks that fail to complete within their expected worst-case time window.
As Figure 3 shows, the OS releases each channel's output data at a fixed time offset from its window start time, even though the channel task finishes execution earlier. These actions can thus hold output data jitter under the scheduling granularity of 125 µs.
Calisto OS is distributed across clusters. Each cluster processor executes the same OS image from shared memory using its own data and manages its local resources. Cluster processors communicate with each other using shared memory, interprocessor interrupts, and semaphore locks in the platform hub.
Calisto BCM1510 chip
The BCM1510 supports up to 240 channels of packet voice gateway services with echo cancellation, depending on the DSP and memory demands of the configured service stack. The Calisto Gateway xChange software suite 10 is a complete carrier-class VoP and media gateway solution with extensive services, including ITU bit-exact voice coders, adaptive echo cancellation, and fax relay. Table 4 shows Calisto BCM1510 breakthrough channel density and power per channel figures for several complete carrier-class Gateway xChange VoP service stacks.
By combining high-channel density and low power, a simple array of 10 Calisto BCM1510 chips can implement a 2,016-channel OC-3 packet voice/media gateway on a single carrier-class telecom blade, as Figure 10 shows. A packet switch/bridge chip and its DRAM packet buffer chips complete the gateway logic.
M ultiprocessors have generally dissipated too much power and cost too much area to be effective in power-constrained embedded systems. By using appropriate parallelism and extensive integration, Calisto shows that a mul- 
41
MARCH-APRIL 2003
