Abstract| Parallel machines have the potential to satisfy the large computational demands of real-time applications. These applications require a predictable communication network, where time-constrained tra c requires bounds on throughput and latency while good average performance su ces for best-e ort packets. This paper presents a new router architecture that tailors low-level routing, switching, arbitration, ow-control, and deadlock-avoidance policies to the con icting demands of each tra c class. The router implements bandwidth regulation and deadline-based scheduling, with packet switching and table-driven multicast routing, to bound end-to-end delay and bu er requirements for time-constrained tra c, while allowing best-e ort trafc to capitalize on the low-latency routing and switching schemes common in modern parallel machines. To limit the cost of servicing time-constrained tra c, the router includes a novel packet scheduler that shares link-scheduling logic across the multiple output ports, while masking the e ects of clock rollover on the represention of packet eligibility times and deadlines. Using the Verilog hardware description language and the Epoch silicon compiler, we demonstrate that the router design meets the performance goals of both tra c classes in a single-chip solution. Verilog simulation experiments on a detailed timing model of the chip show how the implementation and performance properties of the packet scheduler scale over a range of architectural parameters.
I. Introduction
Real-time applications, such as avionics, industrial process control, and automated manufacturing, impose strict timing requirements on the underlying computing system. As these applications grow in size and complexity, parallel processing plays an important role in satisfying the large computational demands. Real-time parallel computing hinges on e ective policies for placing and scheduling communicating tasks in the system to ensure that critical operations complete by their deadlines. Ultimately, a parallel or distributed real-time system relies on an interconnection network that can provide throughput and delay guarantees for critical communication between cooperating tasks; this communication may have diverse performance requirements, depending on the application 1]. However, instead of guaranteeing bounds on worst-case communication latency, most existing multicomputer network designs focus on providing good average network throughput and
The work reported in this paper was supported in part by the Na- packet delay. Consequently, recent years have seen increasing interest in developing interconnection networks that provide performance guarantees in parallel machines 2{8].
Real-time systems employ a variety of network architectures, depending on the application domain and the performance requirements. Although prioritized bus and ring networks are commonly used in small-scale real-time systems 9], larger applications can bene t from the higher bandwidth available in multi-hop topologies. In addition, multi-hop networks often have several disjoint routes between each pair of processing nodes, improving the application's resilience to link and node failures. However, these networks complicate the e ort to guarantee end-toend performance, since the system must bound delay at each link in a packet's route. To deliver predictable communication performance in multi-hop networks, we present a novel router architecture that supports end-to-end delay and throughput guarantees by scheduling packets at each network link. Our prototype implementation is geared toward two-dimensional meshes, as shown in Figure 1 ; such topologies have been widely used as the interconnection network for a variety of commercial parallel machines. The design directly extends to a broad set of topologies, including the class of k-ary n-cube networks; with some changes in the routing of best-e ort tra c, the proposed architecture applies to arbitrary point-to-point topologies.
Communication predictability can be improved by assigning priority to time-constrained tra c or to packets that have experienced large delays earlier in their routes 10]. Ultimately, though, bounding worst-case communication latency requires prior reservation of link and bu er resources, based on the application's anticipated tra c load. Under this tra c contract, the network can provide end-to-end performance guarantees through e ective link-scheduling and bu er-allocation policies. To handle a wide range of bandwidth and delay requirements, the real-time router implements the real-time channel 11{ 13] abstraction for packet scheduling, as described in Section II. Conceptually, a real-time channel is a unidirectional virtual connection between two processing nodes, with a source tra c speci cation and an end-to-end delay bound. Separate parameters for bandwidth and delay permit the model to accommodate a wider range and larger number of connections than other service disciplines 14{ 16] , at the expense of increased implementation complexity.
The real-time channel model guarantees end-to-end performance through a combination of bandwidth regulation and deadline-based scheduling at each link. Implementing packet scheduling in software would impose a signicant burden on the processing resources at each node and To communicate with another node, a processor injects a packet into its router; then, the packet traverses one or more links before reaching the reception port of the router at the destination node.
would prove too slow to serve multiple high-speed links. This software would have to rank packets by deadline for each outgoing link, in addition to scheduling and executing application tasks. With high-speed links and tight timing constraints, real-time parallel machines require hardware support for communication scheduling. An e cient, lowcost solution requires a design that integrates this run-time scheduling with packet transmission. Hence, we present a chip-level router design that handles bandwidth regulation and deadline-based scheduling, while relegating non-realtime operations (such as admission control and route selection) to the network protocol software.
Although deadline-based scheduling bounds the worstcase latency for time-constrained tra c, real-time applications also include best-e ort packets that do not have stringent performance requirements 10, 11, 15, 17] ; for example, good average delay may su ce for some status and monitoring information, as well as the protocol for establishing real-time channels. Best-e ort tra c should be able to capitalize on the low-latency communication techniques available in modern parallel machines without jeopardizing the performance guarantees of time-constrained packets. Section III describes how our design tailors network routing, switching, arbitration, ow-control, and deadlockavoidance policies to the con icting performance requirements of these two tra c classes. Time-constrained trafc employs packet switching and small, xed-sized packets to bound worst-case performance, while best-e ort packets employ wormhole switching 18] to reduce average latency and minimize bu er space requirements, even for large packets. The router implements deadlock-free, dimensionordered routing for best-e ort packets, while permitting the protocol software to select arbitrary multicast routes for the time-constrained tra c; together, exible routing and multicast packet forwarding provide e cient group communication between cooperating real-time tasks.
Section IV describes how the network can reserve bu er and link resources in establishing time-constrained connections. In addition to managing the packet memory and connection data structures, the real-time router e ectively handles the e ects of clock rollover in computing scheduling keys for each packet. The router overlaps communication scheduling with packet transmission to maximize utilization of the network links. To reduce hardware complexity, the architecture shares packet bu ers and sorting logic amongst the router's multiple output links, as discussed in Section V; a hybrid of serial and parallel comparison operations enables the scheduler to trade space for time to further reduce implementation complexity. Section VI describes the router implementation, using the Verilog hardware description language and the Epoch silicon compiler. The Epoch implementation demonstrates that the router can satisfy the performance goals of both tra c classes in an a ordable, single-chip solution. Verilog simulation experiments on a detailed timing model of the chip show the correctness of the design and investigate the scaling properties of the packet scheduler across a range of architectural parameters. Section VII discusses related work on real-time multicomputer networks, while Section VIII concludes the paper with a summary of the research contributions and future directions.
II. Real-Time Channels
Real-time communication requires advance reservation of bandwidth and bu er resources, coupled with run-time scheduling at the network links. The real-time channel model 11] provides a useful abstraction for bounding endto-end network delay, under certain application tra c characteristics.
Tra c parameters: A real-time channel is a unidirectional virtual connection that traverses one or more network links. In most real-time systems, application tasks exchange messages on a periodic, or nearly periodic, basis. As a result, the real-time channel model characterizes each connection by its minimum spacing between messages (I min time units) and maximum message size (S max bytes), resulting in a maximum transfer rate of S max =I min bytes per unit time. To permit some variation from purely periodic tra c, a connection can generate a burst of up to B max messages in excess of the periodic restriction I min . Together, these three parameters form a linear bounded arrival process 19] that governs a connection's tra c generation at the source node.
End-to-end delay bound: In addition to these tra c parameters, a connection has a bound D on end-to-end message delay, based on the minimum message spacing I min . At the source node, a message m i generated at time t i has a logical arrival timè Table I . If Queue 1 is empty, the link services best-e ort tra c from Queue 2, ahead of any early time-constrained messages (i.e.,`j(m i ) > t). This improves the average performance of best-e ort tra c without violating the delay requirements of time-constrained communication. Queue 3 holds early time-constrained tra c, e ectively absorbing variations in delay at the previous node. Upon reaching its logical arrival time, a message moves from Queue 3 to Queue 1.
Link horizon parameter: By delaying the transmission of early time-constrained messages, the link scheduler can avoid overloading the bu er space at the downstream node 11, 15, 16] . Still, the scheduler could potentially improve link utilization and average latency by transmitting early messages from Queue 3 when the other two scheduling queues are empty. To balance this trade-o between bu er requirements and average performance, the link can transmit an early time-constrained message from Queue 3, as long as the message is within a small horizon h 0 of its logical arrival time (i.e.,`j(m i ) t + h). Larger values of h permit the link to transmit more early timeconstrained tra c, at the expense of increased memory requirements at the downstream node. Although each connection could conceivably have its own h value, employing a single horizon parameter allows the link to transmit early tra c directly from the head of Queue 3, without any perconnection data structures.
Bu er requirements: To avoid bu er over ow or message loss, a connection must reserve su cient memory for storing tra c at each node in its route. The required bu er space at node j depends on the connection's local delay bound d j , as well as the horizon parameter h j?1 for the incoming link. In particular, node j can receive a mes- I min messages from this connection at the same time. By reserving bu er and bandwidth resources in advance, the realtime channel model guarantees that every message arrives at its destination node by its deadline, independent of other best-e ort and time-constrained tra c in the network.
III. Mixing Best-Effort and Time-Constrained Traffic
Although the real-time channel model bounds the worst-case performance of time-constrained messages, the scheduling model in Table I can impose undue restrictions on the packet size and ow-control schemes for best-e ort tra c. To overcome these limitations, we propose a router architecture that tailors its low-level communication policies to the unique demands of the two tra c classes. Finegrain, priority-based arbitration at the network links permits the best-e ort tra c to capitalize on the low-latency techniques in modern multicomputer networks without sacri cing the performance guarantees of the time-constrained connections. Figure 2 shows the high-level architecture of the real-time router, with separate control and data path for the two tra c classes. time-constrained packets. The router includes a packet memory, connection routing table, and scheduling logic to support delay and bandwidth guarantees for time-constrained tra c. To connect to the local processor, the router exports a control interface, a reception port, and separate injection ports for each tra c class.
A. Complementary Switching Schemes
To ensure that time-constrained connections meet their delay requirements, the router must have control over bandwidth and memory allocation. For example, suppose that a time-constrained message arrives with a tight deadline (i.e.,`(m i ) + d ? t is small), while the outgoing link is busy transmitting other tra c. To satisfy this tight timing requirement, the outgoing link must stop servicing any lower-priority messages within a small, bounded amount of time. This introduces a direct relationship between connection admissibility and the maximum packet size of the time-constrained and best-e ort tra c sharing the link. In most real-time systems, time-constrained communication consists of 10{20 byte exchanges of command or status information 9]. Consequently, the real-time router restricts time-constrained tra c to small, xed-size packets that can support a distributed memory read or write operation. This bounds link access latency and bu ering delay while simplifying memory allocation in the router.
To ensure predictable consumption of link and bu er resources, time-constrained tra c employs store-and-forward packet switching. By bu ering packets at each node, packet switching allows each router to independently schedule packet transmissions to satisfy per-hop delay requirements. To improve average performance, the time-constrained tra c could conceivably employ virtual cut-through switching 22] to allow an incoming packet to proceed directly to an idle outgoing link. However, in contrast to traditional virtual cut-through switching of best e ort trafc, the real-time router cannot forward a time-constrained packet without rst assessing its logical arrival time (to ensure that the downstream router has su cient bu er space for the packet) and computing the packet deadline (which serves as the logical arrival time at the downstream router). To avoid this extra complexity and overhead, the initial design of the real-time router implements store-and-forward packet switching, which has the same worst-case performance guarantees as virtual cut-through switching. A future implementation could employ virtual cut-through switching to reduce the average latency of the time-constrained tra c.
Although packet switching delivers good, predictable performance to small, time-constrained packets, this approach would signi cantly degrade the average latency of long, best-e ort packets. Even in a lightly-loaded network, end-to-end latency under packet switching is proportional to the product of packet size and the length of the route. Instead, the best-e ort tra c can employ wormhole switching 18] for lower latency and reduced bu er space requirements. Similar to virtual cut-through switching, wormhole switching permits an arriving packet to proceed directly to the next node in its route. However, when the outgoing link is not available, the packet stalls in the network instead of bu ering entirely within the router.
In e ect, wormhole switching converts the best-e ort scheduling \queue" in Table I into a logical queue that spans multiple nodes. The router simply includes small ve-byte it ( ow control unit) bu ers 23] to hold a few bytes of a packet from each input link. When an incoming packet lls these bu ers, inter-node ow control halts further transmission from the previous node until more space is available; once the ve-byte chunk proceeds to a bu er at the outgoing link, the router transmits an acknowledg- ment bit to signal the upstream router to start sending the next it. This ne-grain, per-hop ow control permits beste ort tra c to use large variable-sized packets, reducing or even avoiding packetization overheads, without increasing bu er complexity in the router. The combination of wormhole and packet switching, with best-e ort tra c consuming small it bu ers and time-constrained connections reserving packet bu ers, results in an e ective partitioning of router resources.
B. Separate Logical Resources
Even though wormhole and packet switching exercise complementary bu er resources, best-e ort and timeconstrained tra c still share access to the same network links. To provide tight delay guarantees for timeconstrained connections, the router must bound the time that the variable-sized, wormhole packets can stall the forward progress of on-time, time-constrained tra c. However, a blocked wormhole packet can hold link resources at a chain of consecutive routers in the network, indirectly delaying the advancement of other tra c that does not even use the same links. This complicates the e ort to provision the network to bound worst-case end-to-end latency, as discussed in the treatment of related work in Section VII. In order to control the interaction between the two tra c classes, the real-time router divides each link into two virtual channels 23]. A single bit on each link di erentiates between time-constrained and best-e ort packets, as shown in Figure 3 ; each link also includes an acknowledgment bit for ow control on the best-e ort virtual channel.
Each wormhole virtual channel performs round-robin arbitration on the input links to select an incoming beste ort packet for service, while the packet-switched virtual channel transmits time-constrained packets based on their deadlines and logical arrival times. Priority arbitration between the two virtual channels tightly regulates the intrusion of best-e ort tra c on time-constrained packets on each outgoing link. This e ectively provides it-level preemption of best-e ort tra c whenever an on-time timeconstrained packet awaits service, while permitting wormhole its to consume any excess link bandwidth. In a separate simulation study, we have demonstrated the e ectiveness of using it-level priority arbitration policies to mix best-e ort wormhole tra c and time-constrained packetswitched tra c 24{26].
While the real-time router gives preferential treatment to time-constrained tra c, the outgoing links transmit beste ort its ahead of any early time-constrained packets, consistent with the policies in Table I . Although this arbitration mechanism ensures e ective scheduling of the tra c on the outgoing links and the reception port, the beste ort and time-constrained packets could still contend for resources at the injection port at the source node. The local processor could solve this problem by negotiating between best-e ort and time-constrained tra c at the injection port, but this would require the processor to perform it-level arbitration. Instead, the real-time router includes a dedicated injection port for each tra c class. The two injection ports, coupled with the low-level arbitration on the outgoing links, ensure that time-constrained tra c has ne-grain preemption over the best-e ort packets across the entire path through the network, while allowing beste ort packets to capitalize on any remaining link bandwidth.
C. Bu ering and Packet Forwarding
To support the multiple incoming and outgoing ports, the real-time router design requires high throughput for receiving, storing, and transmitting packets. Internally, the router isolates the best-e ort and time-constrained tra c on separate buses to increase the throughput and reduce the complexity of the arbitration logic. Each incoming and outgoing port includes nominal bu er space to avoid stalling the ow of data while waiting for access to the bus. The best-e ort bus is one it wide and performs roundrobin arbitration among the it bu ers at the incoming ports. Running at the same speed as the byte-wide input ports, this ve-byte bus has su cient throughput to accommodate a peak load of best-e ort tra c. Transferring best-e ort packets in ve-byte chunks incurs a small initial transmission delay at each router, which could be reduced by using a crossbar switch; however, we employ a shared bus for the sake of simplicity. Other recent multicomputer router architectures have used a wide bus for it transfer 27, 28] .
The structure and placement of packet bu ers plays a large role in the router's ability to accommodate the performance requirements of time-constrained connections. The simplest solution places a separate queue at each input link. However, input queuing has throughput limitations 29], since a packet may have to wait behind other tra c destined for a di erent outgoing link. In addition, queuing packets at the incoming links complicates the e ort to schedule outgoing tra c based on delay and throughput requirements. Instead, the real-time router queues timeconstrained packets at the output ports; the router shares a single packet memory among the multiple output ports to maximize the network's ability to accommodate timeconstrained connections with diverse bu er requirements. To accommodate the aggregate memory bandwidth of the ve input and ve output ports, the router stores packets in 10-byte chunks, with demand-driven round-robin arbitration amongst the ports.
Since time-constrained tra c is not served in a rstin rst-out order, the real-time router must have a data structure that records the idle memory locations in the packet bu er. Similar to many shared-memory switches in high-speed networks, the real-time router maintains an idle-address pool 29], implemented as a stack. This stack consists of a small memory, which stores the address of each free location in the packet bu er, and a pointer to the rst entry. Initially, the stack includes the address of each location in the packet memory. An incoming packet retrieves an address from the top of the stack and increments the stack pointer to point to the next available entry. Upon packet departure, the router decrements this pointer and returns the free location to the top of stack. The idleaddress stack always has at least one free address when a new packet arrives, since the real-time channel model never permits the time-constrained tra c to overallocate the bu er resources.
D. Routing and Deadlock-Avoidance
Although wormhole switching reduces the bu er requirements and average latency for best-e ort tra c, the lowlevel inter-node ow control could potentially introduce cyclic dependencies between stalled best-e ort packets. To avoid these cycles, the real-time router implements dimension-ordered routing, a shortest-path scheme that completely routes a packet in the x-direction before proceeding in the y-direction to the destination, as shown by the shaded nodes in Figure 1 . Dimension-ordered routing avoids packet deadlock in a square mesh 30] and also facilitates an e cient implementation based on x and y o sets in the packet header, as shown in Figure 4(a) ; the o sets reach zero when the packet has arrived at its destination node. To improve the performance of best-e ort tra c, an enhanced version of the router could support adaptive wormhole routing and additional virtual channels, at the expense of increased implementation complexity 31, 32] . In particular, non-minimal adaptive routing would enable best-e ort packets to circumvent links with a heavy load of time-constrained tra c.
Although routing is closely tied with deadlock-avoidance for best-e ort packets, the real-time router need not dictate a particular routing scheme for the time-constrained trafc. Instead, each time-constrained connection has a xed path through the network, based on a table in each router; this table is indexed by the connection identi er eld in the header of each time-constrained packet, as shown in Figure 4(b) . As part of establishing a real-time channel, the network protocol software can select a xed path from the source to the destination(s), based on the available bandwidth and bu er resources at the routers. The protocol software can employ a variety of algorithms for selecting unicast and multicast routes based on the resources available in the network 33]. Once the connection establishment protocol reserves bu er and bandwidth resources for a real-time channel, the combination of bandwidth regulation and packet scheduling prevents packet deadlock for time-constrained tra c. Table II summarizes how the realtime router employs these and other policies to accommodate the con icting performance requirements of the two tra c classes.
IV. Managing Time-Constrained Connections
A real-time multicomputer network must have e ective mechanisms for establishing connections and scheduling packets, based on the delay and throughput requirements of the time-constrained tra c. To permit a single-chip implementation, the real-time router o oads non-real-time operations, such as route selection and admission control, to the network protocol software. At run-time, the router coordinates access to bu er and link resources by managing the packet memory and the connection data structures. In addition, the router architecture introduces ecient techniques for bounding the range of logical arrival times and deadlines, to limit scheduler delay and implementation complexity.
A. Route Selection and Admission Control
Establishing a real-time channel requires the application to specify the tra c parameters and performance requirements for the new connection. Admitting a new connection, and selecting a multi-hop route with suitable local delay parameters, is a computationally-intensive procedure 10, 11, 20] . Fortunately, channel establishment typically does not impose tight timing constraints, in contrast to the actual data transfer which requires explicit guarantees on minimumthroughput and worst-case delay. In fact, in most cases, the network can establish the required timeconstrained connections before the application commences. To permit a single-chip solution, the real-time router relegates these non-real-time operations to the protocol software. The network could select routes and admit new connections through a centralized server or a distributed protocol. In either case, this protocol software can use the best-e ort virtual network, or even a set of dedicated timeconstrained connections, to exchange information to select a route and provision resources for each new connection.
The route selected for a connection depends on the trafc characteristics and performance requirements, as well as the available bu er and bandwidth resources in the network. As part of establishing a new real-time channel, the protocol software assigns a unique connection identi er at each hop in the route. Then, each node in the route writes control information into the router's connection table, as shown in Table III . At run-time, this table is indexed by the connection identi er eld of each incoming timeconstrained packet, as shown in Figure 4 (b). To minimize the number of pins on the router chip, the controlling processor updates this table as a sequence of four, one-byte operations that specify the incoming connection identi er and the three elds in the table. After closing a connection, the network protocol software can reuse the connection identi er by overwriting the entry in the routing table. The processor uses the same control interface to set the horizon parameters h for each of the ve outgoing ports. As shown in Table III , the routing table stores the connection's identi er at the next node, the local delay bound d, and a bit mask for directing tra c to the appropriate outgoing port(s). When a packet arrives, the router indexes the table with the incoming connection identi er and replaces the header eld with the new identi er for the downstream router. At the same time, the router computes the packet's deadline from the logical arrival time in the packet header and the local delay bound in the connection table. Finally, the bit mask permits the router to forward an incoming packet to multiple outgoing ports, allowing the network protocol software to establish multicast real-time channels. This facilitates e cient, timely communication between a set of cooperating nodes. To simplify the design, the real-time router requires a multicast connection to use the same value of d for each of its outgoing ports at a single node. Then, based on the bit mask in the routing table, the router queues the updated packet for transmission on the appropriate outgoing port(s).
By implementing a shared packet memory, the real-time router can store a single copy of each multicast packet, removing the packet only after it has been transmitted by each output port selected in the bit mask. The shared packet memory also permits the network protocol software to employ a wide variety of bu er allocation policies. On the one extreme, the route selection and admission control protocols could allocate packet bu ers to any new connection, independent of its outgoing link. However, this could allow a single link to consume the bulk of the memory locations, reducing the chance of establishing time-constrained connections on the other outgoing links. Instead, the admission control protocol should bound the amount of bu er space available to each of the ve outgoing ports. Similarly, the network could limit the size of the link horizon parameters h to reduce the amount of memory required by each connection. In particular, at run-time, a higher-level protocol could reduce the h values of a router's incoming links when the node does not have su cient bu er space to admit new connections.
B. Handling a Clock with Finite Range
The packet deadline at one node serves as the logical arrival time at the downstream node in the route. Carrying these logical arrival times in the packet header, as shown in Figure 4 (b), implicitly assumes that the network routers have a common notion of time, within some bounded clock skew. Although this is not appropriate in a wide-area network context, the tight coupling in parallel machines minimizes the e ects of clock skew. Alternatively, the router could store additional information in the connection table to compute`j(m i ) from a packet's actual arrival time and the logical arrival time of the connection's previous packet 34]; however, this approach would require the router to periodically refresh this connection state to correctly handle the e ects of clock rollover. Instead, the realtime router avoids this overhead by capitalizing on the tight coupling between nodes to assume synchronized clocks.
Even with synchronized clocks, the real-time router cannot completely ignore the e ects of clock rollover. To schedule time-constrained tra c, the router architecture includes a real-time clock, implemented as a counter that increments once per packet transmission time. For a practical implementation, the router must limit the number of bits b used to represent the logical arrival times and deadlines of time-constrained packets. Since logical arrival times continually increase, the design must use modulo arithmetic to compute packet deadlines and schedule tra c for transmission. As a result, the network must restrict the logical arrival times that can exist in a router at the same time; otherwise, the router cannot correctly distinguish between di erent packets awaiting access to the outgoing link.
Selecting a value for b introduces a fundamental trade-o between connection admissibility and scheduler complexity. To select a packet for transmission, the scheduler must compare the deadlines and logical arrival times of the timeconstrained packets; for example, the data structures in Table I To satisfy connection delay, throughput, and bu er requirements, each outgoing port must schedule timeconstrained packets based on their logical arrival times and deadlines, as well as the horizon parameter. The real-time router reduces implementation complexity by sharing a single scheduler amongst the early and on-time tra c on each of the ve output ports. Extensions to the scheduler architecture further reduce the implementation cost by trading space for time.
A. Integrating Early and On-Time Packets
To maximize link utilization and channel admissibility, each outgoing port should overlap packet scheduling operations with packet transmission. As a result, packet size determines the acceptable worst-case scheduling delay. Scheduling time-constrained tra c, based on delay or throughput parameters, typically requires a priority queue to rank the outgoing packets. Priority queue architectures introduce considerable hardware complexity 35{39], particularly when the link must handle a wide range of packet priorities or deadlines. For example, most high-speed solutions require O(n) hardware complexity to rank n packets, using a systolic array or shift register consisting of n comparators 35, 40, 41] . Additional technical challenges arise in trying to integrate packet scheduling with bandwidth regulation 42], since the link cannot transmit a packet unless it has reached its logical arrival time.
To perform bandwidth regulation and deadline-based scheduling, the real-time router could include two priority queues for each of its ve outgoing ports, as suggested by Table I . However, this approach would be extremely expensive and would require additional logic to transfer packets from the \early" queue to the \on-time" queue; this is particularly complicated when multiple packets reach their eligibility times simultaneously. In the worst case, an outgoing port could have to dequeue a packet from Queue 1 or Queue 3, enqueue several arriving packets to Queue 1 and/or Queue 3, and move a large number of packets from Queue 3 to Queue 1, all during a single packet transmission time. To avoid this complexity, the real-time router does not attempt to store the time-constrained packets in sorted order. Instead, the router selects the packet with the smallest key via a comparator tree, as shown in Figure 6 . Like the systolic and shift register approaches, the tree architecture introduces O(n) hardware complexity. For the moderate size of n in a single-chip router, the comparator tree can overlap the O(lgn) stages of delay with packet transmission.
To avoid this excessive complexity, the real-time router integrates early and on-time packets into a single data structure. Each link schedules time-constrained packets based on sorting keys, as shown in Figure 7 , where smaller keys have higher priority. A single bit di erentiates between early and on-time packets. For on-time tra c, the lower bits of the key represent packet laxity, the time remaining till the local deadline expires, whereas the key for early tra c represents the time left before reaching the packet's logical arrival time. The packet keys are normalized, relative to current time t, to allow the scheduler to perform simple, unsigned comparison operations, even in the presence of clock rollover. Each scheduling operation operates independently to locate the packet with the minimum sorting key, permitting dynamic changes in the values of keys. The base of the tree computes a key for each packet, based on the packet state and the current time t, as shown in the right side of Figure 6 ; the base of the tree stores per-packet state information, whereas the packet memory stores the actual packet contents.
B. Sharing the Scheduler Across Output Ports
By using a comparator tree, instead of trying to store the packets in sorted order, the router can allow all ve outgoing ports to share access to this scheduling logic, since the tree itself does not store the packet keys. As shown in Figure 6 , each leaf in the tree stores a logical arrival time`(m), a deadline`(m)+d, and a bit mask of outgoing ports, assigned at packet arrival based on the connection state. The bit mask determines if the leaf is eligible to compete for access to a particular outgoing port. When a port transmits a selected packet, it clears the corresponding eld in the leaf's bit mask; a bit mask of zero indicates an empty packet leaf slot and a corresponding idle slot in the packet memory. The base of the tree also determines if packets are early (`(m) > t) or on-time (`(m) t) and computes the sorting keys based on the current value of t. At the top of the sorting tree, an additional comparator checks to see if the winner is an early packet that falls within the port's horizon parameter; if so, the port transmits this packet, unless best-e ort its await service.
Still, to share the comparator logic, the scheduler must operate quickly enough to overlap run-time scheduling with packet transmission on each of the outgoing ports. Consequently, the real-time router pipelines access to the comparator tree. With p stages of pipelining, the scheduler has a row of latches at p?1 levels in the tree, to store the sorting key and bu er location for the winning packet in the subtrees. Every few cycles, another link begins its scheduling operation at the base of the tree. Similarly, every few cycles, another link completes a scheduling operation and can initiate a packet transmission. As a result, the router staggers packet departures on the ve outgoing ports. The necessary amount of pipelining depends on the latency of the comparator tree, relative to the packet transmission delay.
C. Balancing Hardware Complexity and Scheduler Latency
The pipelined comparator tree has relatively low hardware cost, compared to alternate approaches that implement separate priority queues for the early and on-time packets on each outgoing port. However, as shown in Section VI, the scheduler logic is still the main source of complexity in the real-time router architecture. To handle n packets, the scheduler in Figure 6 has a total of 2+lg n stages of logic, including the operations at the base of the tree as well as the comparator for the horizon parameter. In terms of implementation cost, the tree requires n comparators and n leaf nodes, for a total of 2n elements of similar complexity. As n grows, the number of leaf nodes can have a signi cant in uence on the bus loading at the base of the tree. Fortunately, for certain values of n, the comparator tree has low enough latency to avoid the need to fully pipeline the scheduling logic. This suggests that the scheduler could reduce the number of comparators by trading space for time.
Under this approach, the scheduler combines several leaf units into a single module with a small memory (e.g., a register le) to store the deadlines and logical arrival times for k packets, as shown in Figure 8 . At the base of the tree, each of the n=k modules can sequentially compare its k sorting keys, using a single comparator, to select the packet with the minimum key; this incurs k stages of delay. Then, a smaller comparator tree nds the smallest key amongst n=k packets. As a result, the scheduler incurs (k + 1) + lg n k stages of delay. Note that, for k = 1, the architecture reduces to the comparator tree in Figure 6 , with its 2 + lg n stages of logic. For larger values of k, the scheduler has larger arbitration delay but reduced implementation complexity. The architecture in Figure 8 has 2n=k comparators, as well as a lighter bus loading of n=k elements at the base of the tree. In addition, larger values of k allow the base of the tree to consist of n=k k-element register les, instead of n individual registers, with a reduction in chip complexity. With a careful selection of n and k, the real-time router can have an e cient, single-chip implementation that performs bandwidth regulation and deadline-based scheduling on multiple outgoing ports.
VI. Performance Evaluation
To demonstrate the feasibility of the real-time router, and study its scaling properties, a prototype chip has been designed using the Verilog hardware description language and the Epoch silicon compiler from Cascade Design Automation. This framework facilitates a detailed evaluation of the implementation and performance properties of the architecture. The Epoch tools compile the structural and behavioral Verilog models to generate a chip layout and an annotated Verilog model for timing simulations. These tools permit extensive testing and performance evaluation without the expense of chip fabrication.
A. Router Complexity
Using a three-metal, 0:5 m CMOS process, the 123-pin chip has dimensions 8:1 mm 8:7 mm for an implementation with 256 time-constrained packets and up to 256 connections, as shown in Table IV . The scheduling logic accounts for the majority of the chip area, with the packet memory consuming much of the remaining space, as shown in Table V . Operating at 50 MHz, the chip can transmit or receive a byte of data on each of its ten ports every 20 nsec. This closely matches the access time of the 10-byte-wide, single-ported SRAM for storing time-constrained tra c; the memory access latency is the bottleneck in this realization of the router. Since time-constrained packets are 20-bytes long, the scheduling logic must select a packet for transmission every 400 nsec for each of the ve output ports To match the memory and link throughputs, the comparator tree consists of a two-stage pipeline, where each stage requires approximately 50 nsec.
Although the tree could incorporate up to ve pipeline stages, the two-stage design provides su cient throughput to satisfy the output ports. This suggests that the link scheduler could e ectively support a larger number of packets or additional output ports, for a higher-dimensional mesh topology. Alternately, the router design could reduce the hardware cost of the comparator tree by sharing comparator logic between multiple leaves of the tree, as discussed in Section V-C. Figure 9 highlights the costperformance trade-o s of logic sharing, based on Epoch implementations and Verilog simulation experiments. As k increases, the scheduler complexity decreases in terms of area, transistor count, and power dissipation, with reasonable increases in scheduler latency. The results start with a grouping size of k=4, since the Epoch library does not support static RAM components with fewer than four lines. (For k = 1, the graphs plot results from the router implementation in Table V , which uses ip-ops to store packet state at the base of the tree. The Epoch silicon compiler generates a better automated layout of these ip-ops than of the small SRAMs, resulting in better area statistics in Table V , despite the larger transistor count. A manual layout would signi cantly improve the area statistics for k>1; still, the area graph shows the relative improvement for larger values of k.) These plots can help guide the trade-o between hardware complexity and scheduler latency in the router implementation. For example, a group size of k =4 reduces the number of transistors by 45% (from 555; 025 to 306; 829). The number of transistors does not decrease by a factor of four, since the smaller scheduler still has to store the state information for each packet; in addition, the scheduler requires additional logic and registers to serialize access to the shared comparators. Still, logic sharing signi cantly reduces implementation complexity. Larger values of k further reduce the number of comparators and improve the density of the memory at the base of the tree. Scheduler latency does not grow signi cantly for small values of k. For k = 4, delay in the comparator tree increases by just 67% (from 0:115 sec to 0:192 sec). The lower bus loading at the base of the tree helps counteract the increased latency from serializing access to the rst layer of comparators and signi cantly reduces power dissipation.
B. Simulation Experiments
Since Verilog simulations of the full chip are extremely memory and CPU intensive, we focus on a modest set of timing experiments, aimed mainly at testing the correctness of the design. A preliminary experiment tests the baseline performance of best-e ort wormhole packets. To study a multi-hop con guration, the router connects its links in the x and y directions. The packet proceeds from the injection port to the positive x link, then travels from the negative x input link to the positive y direction; after reentering the router on the negative y link, the packet proceeds to the reception port. In this test, a b byte wormhole packet incurs an end-to-end latency of 30 + b cycles, where the link transmits one byte in each cycle. This delay is proportional to packet length, with a small overhead for synchronizing the arriving bytes, processing the packet header, and accumulating ve-byte chunks for access to the router's internal bus. In contrast, packet switching would introduce additional delay to bu er the packet at each hop in its route.
An additional experiment illustrates how the router schedules time-constrained packets to satisfy delay and throughput guarantees, while allowing best-e ort tra c to capitalize on any excess link bandwidth. Figure 10 plots the link bandwidth consumed by best-e ort tra c and each of three time-constrained connections with the following parameters, in units of 20-byte slots:
d I min 0 8 9 1 5 7 2 3 4 All three connections compete for access to a single network link with horizon parameter h = 0, where each connection has a continual backlog of tra c. The time-constrained connections receive service in proportion to their throughput requirements, since a packet is not eligible for service As k grows, implementation complexity decreases but scheduler latency increases.
till its logical arrival time. Similarly, the link transmits each packet by its deadline, with best-e ort its consuming any remaining link bandwidth.
VII. Related Work
This paper complements recent work on support for realtime communication in parallel machines 2{7]. Several projects have proposed mechanisms to improve predictability in the wormhole-switched networks common in modern multicomputers. In the absence of hardware support for priority-based scheduling, application and operating system software can control end-to-end performance by regulating the rate of packet injection at each source node 7]. However, this approach must limit utilization of the communication network to account for possible contention between packets, even from lower-priority tra c. This is a particularly important issue in wormhole networks, since a stalled packet may indirectly block the advancement of other tra c that does not even use the same links. The underlying router architecture can improve predictability by favoring older packets when assigning virtual channels or arbitrating between channels on the same physical link 23].
Although these mechanisms reduce variability in end-to- end latency, more aggressive techniques are necessary to guarantee performance under high network utilization. A router can support multiple classes of tra c, such as user and system packets, by partitioning tra c onto di erent virtual channels, with priority-based arbitration for access to the network links 23]. Flit-level preemption of lowpriority virtual channels can signi cantly reduce intrusion on the high-priority packets. Still, these coarse-grain priorities do not di erentiate between packets with di erent latency tolerances. With additional virtual channels, the network has greater exibility in assigning packet priority, perhaps based on the end-to-end delay requirement, and restricting access to virtual channels reserved for higherpriority tra c 4, 5] .
Coupled with restrictions on the source injection rate, these policies can bound end-to-end packet latency by limiting the service and blocking times for higher-priority trafc 3]. Although assigning priorities to virtual channels provides some control over packet scheduling, this ties priority resolution to the number of virtual channels. The router can support ne-grain packet priorities by increasing the number of virtual channels, at the expense of additional implementation complexity; these virtual channels incur the cost of additional it bu ers and larger virtual channel identi ers, as well as more complex switching and arbitration logic 32]. Instead of dedicating virtual channels and it bu ers to each priority level, a router can increase priority resolution by adopting a packet-switched design.
The priority-forwarding router chip 6] follows this approach by employing a 32-bit priority eld in small, 8-packet priority queues at each input port. The router incorporates a priority-inheritance protocol to limit the effects of priority inversion when a full input bu er limits the transmission of high-priority packets from the previous node; the input bu er's head packet inherits the priority of the highest-priority packet still waiting at the upstream router. In contrast, the real-time router implements a single, shared output bu er that holds up to 256 timeconstrained packets, with a link-scheduling and memory reservation model that implicitlyavoids bu er over ow. By dynamically assigning an 8-bit packet priority at each node, the real-time router can satisfy a diverse range of end-toend delay bounds, while permitting best-e ort wormhole tra c to capitalize on any excess link bandwidth.
VIII. Conclusion
Parallel real-time applications impose diverse communication requirements on the underlying interconnection network. The real-time router design supports these emerging applications by bounding packet delay for time-constrained tra c, while ensuring good average performance for beste ort tra c. Low-level control over routing, switching, and ow control, coupled with ne-grain arbitration at the network links, enables the router to e ectively mix these two diverse tra c classes. Careful handling of clock rollover enables the router to support connections with diverse delay and throughput parameters with small keys for logical arrival times and deadlines. Sharing scheduling logic and packet bu ers amongst the ve output ports permits a single-chip solution that handles up to 256 timeconstrained packets simultaneously. Experiments with a detailed timing model of the router chip show that the design can operate at 50 MHz with appropriate pipelining of the scheduling logic. Further experiments show that the design can trade space for time to reduce the complexity of the packet scheduler.
As ongoing research, we are considering alternate linkscheduling algorithms that would improve the router's scalability. In this context, we are investigating e cient hardware architectures for integrating bandwidth regulation and packet scheduling 42]; these algorithms include approximate scheduling schemes that balance the trade-o between accuracy and complexity, allowing the router to e ciently handle a larger number of time-constrained packets. We are also exploring the use of the real-time router as a building block for constructing large, high-speed switches that support the quality-of-service requirements of realtime and multimedia applications. The router's delay and throughput guarantees for time-constrained tra c, combined with good best-e ort performance and a single-chip implementation, can e ciently support a wide range of modern real-time applications, particularly in the context of tightly-coupled local area networks.
