Abstract
Introduction
There is a recent surge of interest in single-chip parallel processors. As more and more components are integrated in such systems, it becomes a severe challenge to design and implement the on-chip interconnection network. Throughput and latency are among the most important performance characteristics of such network that carries memory requests from processors to memory modules and responses from memory modules to the processors.
Traditional interconnection networks such as hypercube and butterfly have been used in parallel computing systems
Background
In this section, we briefly review the network topology and the arbitration primitives in MoT network. Additional details on background, network features and operation have been discussed in [5] . The MoT network consists of two main structures, a set of fan-out (routing) trees and a set of fan-in (arbitration) trees. Figure 1 shows the communication paths from processor clusters to memory modules for three memory requests. Paths of memory requests (0,2) (2,1) and (3, 2) are highlighted. Empty circles and squares represent routing (Fig 2.a) and arbitration (Fig 2.b) primitives respectively. There is a unique path between Each memory request will travel from the source through a fan-out tree and then a fanin tree before it reaches the destination. In fan-out trees, routing decision is trivial from the binary representation of the destination address. There is no routing decision in the fan-in trees, since each packet in a fan-in tree has the same destination.
Contention could occur, when two packets from different sources to different destinations compete for a shared resource. Fan-out trees eliminate competition between packets from different sources, and fan-in trees eliminate competition between packets to different destinations. This separation avoids contention and improves throughput.
Flow control is handled locally, through handshakes between successive switching primitives. A slightly modified version of a relay station [6] is used to prevent data loss when the successor primitive is full. Figure 2 illustrates the switching primitives in our MoT network. Each node in the fan-out and fan-in trees of the network will be implemented using the fan-out (Figure 2(a) ) or fan-in (Figure 2(b) ) primitives. The pipeline primitive (Figure 2(c) ) is used to divide long wires into multiple short segments.
Main results of [5] show that MoT sustains high throughput and low latency at high traffic rates. The peak throughput, or the network capacity, for a MoT network is 1.0 packets-per-cycle (ppc) at each port; and the bisection bandwidth is equal to N ppc, where N is the number of ports. For example, in a 64-terminal MoT, the average throughput of all terminals reaches 0.98ppc under sustained uniform traffic that is injected at 100% network capacity Average packet latency is 14 cycles at a low traffic rate such as 10% of network capacity, and 23 cycles at a high traffic rate such as 90% of network capacity.
Design and Implementation Flow
In this section we first explain the importance of validating the previous results on MoT network with cycleaccurate Verilog simulator. We then modify the arbitration primitive to support the store operation. We describe the physical design of the MoT network as a further step towards evaluating its layout-accurate performance. Finally, pipelines are inserted to deal with the long wire delays.
Cycle-Accurate Validation
In [5] , the performance model of the MoT network has been evaluated using a custom-made simulator, written in C++ using SystemC libraries. There was no earlier study of a cycle-accurate simulator for verifying the MoT network model in [5] . To demonstrate accuracy, some butterfly network simulations has been compared with the "booksim" simulator of [9] . However, the simulator in [5] is optimized for MoT network, and the simulator in [9] is optimized for traditional networks such as hypercube and butterfly. Therefore, the accuracy of the comparison was limited.
Prior to the current paper, switch primitives have been individually synthesized into generic technology, but the whole MoT network has not been synthesized and verified. Therefore, a realistic hardware model was not available for validation. In this paper we derive a synthesizable verilog model of the full MoT network using our own high level synthesis tool. We perform RTL and gate-level netlist simulations, and validate earlier results.
We assume uniform traffic pattern, which is expected for the memory architecture described in [16] , due to the use of a hashing mechanism [2, 4, 10, 15] .
Modified Arbitration
The smallest unit of information flowing in the network is called flit or flow control digit [9] . The performance model of [5] is based on exchanging single-flit packets between terminals. In case of a load operation, the processor sends the address to the memory module, and the memory module responds with the requested data. In this most common mode of operation, each packet consists of a single flit with sufficiently many bits, that contains either the address or the data.
In case of a store operation the processor sends the address and the data to the memory module. A flit could be sufficiently wide to hold both the address and the data, however this would waste bandwidth when load instructions are sent through the network. Alternatively, a store packet could consist of two flits that are injected consecutively to the network. In this case additional effort is required to relate address and data pairs that belong together, and perform the correct operation. We consider the following two options for handling store operations.
• Both flits of address-data pair can be marked with an identifier tag, and sent as individual single-flit packets. The memory commits the operation when the second flit with the matching tag arrives. This method requires computation on the processor and the memory module. The network remains unchanged. This is called fair bandwidth arbitration [9] , since the arbitration primitives perform fair arbitration regardless of the type of the packet.
• Second flit is chained to the first one, and they follow each other in the network. The memory receives the pair consecutively.This method requires computation in the network. Specifically, the arbitration primitive must ensure that second flit immediately follows the first one. This introduces a temporary bias to the arbitration operation. The processor and the memory modules remain unchanged. This method is called winnertake-all arbitration [9] . Extra logic in the arbitration primitive may increase clock period and, therefore, reduce throughput. On the other hand, this method reduces average packet latency for multi-flit packets in terms of clock cycles.
We modified the regular arbitration primitive to perform winner-take-all arbitration. We implement these two arbitration primitives and evaluate their performance with a cycle-accurate verilog simulator. The results show that they both provide similar throughput improvement over the single-flit arbitration used in [5] . The improvement is significant especially when load operation dominates. See Section 4.1 for details.
Physical Design
For the layout of the MoT network, we start with RTLlevel verilog description of switch primitives. Our own high-level synthesizer generates higher level modules, such as balanced binary trees. 
Network Layout
The wire area of the MoT network grows as O(N 2 log 2 N ), and the number of tree nodes grow as O(N 2 ), where N is the number of terminals [5] . This would imply that the wire area will dominate the cell area, and the floorplanning must consider wire area constraints. Synthesis results with this particular technology and standard cell library show that cell area is larger than the wire area for practical number of terminals. Table 1 shows these results for different network configurations that are considered in this paper. Wire area grows faster and it can exceed cell area for higher number of terminals and bits per flit. Therefore, our floorplan and placement strategy in this study is based on the cell area of the network.
In a network with N terminals, we create N/2 partitions in order to improve layout quality during placement and routing. Figure 3 shows a network with 8 terminals that has 4 partitions marked P 0 to P 3 . An initially square floorplan is separated into partitions, and each partition is individually placed, routed, and optimized. Depending on other geometrical factors, such as height and width of terminal modules, two partitions could be separated by a gap.
Terminal Circuits
Ideally, our network would interconnect parallel processors and memory modules. We use a terminal node to replace a pair of cluster and memory module. In order to focus on the interconnection network, these nodes are dummy terminals that generate random requests based on programmable parameters, and record statistics upon receiving a packet.
The terminal modules do not affect critical delay path of the network modules. However, since they are generating packets and recording arrivals at each cycle, their critical delay path affects the operation frequency of our taped-out chip. Therefore, we report critical delays for the network module separately. 
Pipeline Insertion
Long wires of MoT network could increase the clock period and reduce the throughput. Inserting pipeline registers to long wires would improve performance [3, 7, 13, 14] . Earlier work [5] proposed to use a pipeline primitive to cut long wires in shorter segments. However, the benefits could not be demonstrated without a physical layout.
Pipeline insertion can be automatized by several ways. State of the art synthesis tools are capable of inserting repeaters. However, they are usually unaware of final wire lengths. Place and route tools can insert any standard cell or module to an existing netlist and connect them to rest of the circuit. However, this requires use of low level commands of the specific tools, and may not be portable. Furthermore, state changes in the circuit cannot be traced back to RTL-level. This could complicate verification and performance evaluation. Our high level synthesis tool inserts pipeline registers at RTL level. Then, the network would have a portable and coherent state machine view through the entire physical design flow.
It is challenging to estimate the optimal wire length to fit in a single pipeline stage. It involves multiple physical design iterations. Furthermore, CAD tools perform several proprietary and heuristic optimizations. Therefore, it is virtually impossible to estimate the exact wire length between two consecutive registers before the layout is finalized.
In this prototyping study, we follow a high level heuristic approach to determine the amount of pipelining, guided by the wire length between the centers of partition P i and the second partition P i+2 , denoted as L pipe in Figure 3 . Thus, we allow the signals to pass over one full partition P i+1 without being stored in a pipeline register. For lack of space, we only note that following this model, an 8-terminal network would not require pipelining. Furthermore, 16 and 32-terminal networks will ideally operate at the same frequency as the 8-terminal network.
Results and Discussion
In this section, we first present simulation results that validate the claims of [5] and provides average throughput per cycle. Then, we lay out networks with 4, 8, 16 and 32 terminals, and obtain their clock rate. The combination of both results will give layout-accurate average throughput for MoT. Finally, we taped-out the 8-terminal design for fabrication.
We used IBM CMOS9SF 90nm technology and regular ARM/Artisan SAGE-X standard cells. Typical operating conditions (V DD ; T ) for this library is given as 1.2V ; 25
• C. In this paper we report delay estimations for a slow corner (worst case) operating conditions, such as 1.08V ; 125
•
Simulation Results
Latency and throughput characteristics for a 64-terminal network is compared with results of [5] in Figure 4 . Table 2 compares the average throughput at highest traffic rate, and latency at three traffic levels. Low, High, and Max represent flit generation rates of 10%, 90%, 100% of network capacity. Throughput is averaged over all terminal ports, and latency is averaged over all recorded packets. Compared to results of [5] , throughput differs between 1% to 2%. Latency results for 64 terminal MoT network are 17% higher for low-traffic case and 6.5% lower for hightraffic case. Such deviations are expected due to different implementation of source queue component as described in Section 3.1.
We simulate the network with different ratios of 1-flit and 2-flit packets, to model a mixture of load and store operations. Traffic rate is adjusted for each run so that average flit injection rate remains constant at the maximum capacity of the network, namely 1 flit per cycle per port. Higher traffic rates would saturate the source queue in the terminal. In that case several packets would be dropped, and the mixture rate could change. For example, a mixture ratio of 30% means that each cycle there is a 77% probability of generating a packet. Additionally, the generated packet has two flits with a probability of 30%, and one flit with a probability of 70%. As a result, the average rate of flit generation is 1.0 per cycle. We simulated fair arbitration and winner-take-all arbitration methods as described in Section 3.2. The variation in latency and throughput for 64-terminal network is shown in Figure 5 . The wide flit case assumes that the flit width is doubled so that any one of load or store operations fits in a single flit. 
Figure 5. 64-terminal MoT simulation results for different methods of handling store operations
Simulations show that using multiple flits for store instructions improves throughput for almost all mixture ratios. There is no significant difference between two methods of arbitration. Layout of both arbitration primitives shows that the increase in clock period due to additional logic is negligible. Latency is improved for low amounts of store instructions, but this could also be caused by the source queue implementation. For 64-terminal network, fair arbitration has slightly lower latency. Additional simulations show that for a 4-terminal network, winner-take-all has lower latency. We conclude that the number of flits in a store instruction is not sufficiently high to make a difference in latency. Further studies with more flits per packet would be beneficial to evaluate MoT performance for cases where multiple data words are moved through the network, such as loading or storing long vectors or streams.
Config

Layout Results
Following the standard flow of the Cadence tools, we synthesized, placed and routed networks with different configurations. Table 3 shows the area and performance results.
We extended the 8-terminal configuration with power routing and I/O pads for fabrication. The final layout is shown in Figure 6 . T1  T0  T2  T3  T4  T5  T7 T6 Figure 6 . Final layout of 8-terminal chip. Table 3 shows that the clock frequency reduces as the number of terminals increases. This is mainly caused by longer wires on the critical path. Results of pipelined configurations 16p and 32p show the benefit of pipelining on frequency and throughput. Average latency increases in pipelined configurations due to increased number of stages between some sources and destinations.
PLL
IN
Partitioning constraints prevented optimal pipeline placement on long wires. Therefore, the improved frequency did not reach the the expected level of an 8-terminal network. Reducing the critical length for pipelining could improve performance. Pipeline circuits would be placed within the partitions, instead of between them. Such improvements could incur additional area and latency cost. Evaluation of these trade-offs requires further studies. Table 3 shows that the cell area of laid-out networks exceeds estimations (Table 1) , since the layout tool optimizes for performance by inserting repeaters and using larger cells.
Cell area of 32p is larger than 32, as expected, due to additional pipeline stages. In 16p, the area of added pipeline stages turn out to be comparable to large repeaters on long wires of 16. Therefore, the area of 16p is approximately equal to the area of 16.
The area of the bounding box is approximately twice as much as the cell area, because of the gaps between partitions, and overestimated design margins. We introduced gaps between partitions in order to level the partitions with the terminals (Figure 3 ). The amount of gaps depend on the area and aspect ratio of terminal circuits. In an ongoing study, we are investigating the relationship between processor geometry, and MoT area and performance. In this prototyping study we did not optimize for the area. However, based on Table 1 , we expect the actual area to be close to the cell area.
Power consumption has been estimated based on the layout, and simulated switching activity with highest traffic rate. As expected, the power consumption grows quadratically with the number of terminals, that is, at the same rate as the number of cells. Pipelining increases power consumption by both adding more cells, and increasing operating frequency. In this study, we did not optimize for power consumption. However, typical approaches such as clockgating could reduce power consumption.
Impact on Single-Chip Parallel Processing
A clear lesson of several decades of parallel computing research is that the issue of parallel programming must be properly resolved. The Parallel Random Access Model (PRAM) is an easy model for parallel algorithmic thinking and for programming, as recognized by Culler and Singh [8] , and at least 3 major standard texts on serial algorithms and data-structures. Earlier attempts to support PRAM by a multi-chip multiprocessor (e.g. TERA-MTA [2] , SB-PRAM [17] , NYU Ultracomputer and the IBM RP3, [1, 10] ) have been constrained on memory access performance and had limited success.
The "PRAM-on-Chip" project at the University of Maryland seeks to advance implementation of PRAM in a singlechip parallel processor, using an eXplicit Multi-Threading (XMT) architecture (see Appendix). The XMT architecture eliminates local private caches in order to avoid cache coherence issues and uses hashing mechanism to avoid hot spots [16] . This dramatically increases the load on the interconnection network and makes the network traffic reasonably uniform, rendering the current interconnection networks ineffective. MoT network, as we have described in Section 2, promises high throughput and low latency. This current work brings the concept of MoT network closer to silicon and thus has significant impact.
Based on a recent design [20] , each terminal port of the network could serve up to 16 processors and up to two globally shared cache modules. Our results show that, a pipelined 16-terminal network, supporting up to 256 processors, operates at 748M Hz. With 30-bit wide channels, it provides a peak throughput of 359Gbps, and an average throughput of 334Gbps under uniform traffic.
Conclusion
We perform cycle-accurate Verilog simulation to validate the earlier results on MoT network, which has clear advantages on throughput and latency over traditional interconnection networks. For example, for a 64-terminal network, earlier results overestimated throughput by only 2%, and latency by 6.5% at high traffic load. We propose two extensions: support store operations and avoid long wire delay to further improve throughput of MoT. We conduct the physical design and obtain layout for MoT network of various sizes. The layout of 8-terminal network has recently been accepted by the foundry for fabrication.
While our initial layouts of switch primitives indicate an average throughput of 4.6Tbps in the ideal case for a 64-terminal MoT network, practical constraints led us to defer seeking such rate to future work, perhaps not in a university environment.
A. Explicit Multi-Threading Architecture
The eXplicit Multi-Threading (XMT) on-chip generalpurpose computer architecture [16] is aimed at the classic goal of reducing single task completion time. It is a parallel algorithmic architecture in the sense that it seeks to provide good performance for parallel programs derived from Parallel Random Access Machine/Model (PRAM) algorithms. Ease of parallel programming is now widely recognized as the main stumbling block for extending commodity computer performance growth (e.g., using multi-cores). XMT provides a unique answer to this challenge. First commitment to silicon of XMT is reported in [20] . A 64-processor, 75MHz computer based on field-programmable gate array (FPGA) technology was built at the University of Maryland (UMD).
The PRAM virtual model of computation assumes that any number of concurrent accesses to a shared memory take the same time as a single access. In the Arbitrary Concurrent-Read Concurrent-Write (CRCW) PRAM concurrent access to the same memory location for reads or writes are allowed. Reads are resolved before writes and an arbitrary write unknown in advance succeeds. Design of an efficient parallel algorithm for the Arbitrary CRCW PRAM model would seek to optimize the total number of operations the algorithms performs ("work") and its parallel time ("depth") assuming unlimited hardware. Given such an algorithm, an XMT program is written in XMTC, which is a modest single-program multiple-data (SPMD) multi-threaded extension of C that includes 3 commands: Spawn, Join and PS, for Prefix-Sum a Fetch-and-Incrementlike command. The program seeks to optimize: (i) the length of the (longest) sequence of round trips to memory (LSRTM), (ii) queuing delay to the same shared memory location (known as QRQW), and (iii) work and depth (as per the PRAM model). Optimizing these ingredients is a responsibility shared in a subtle way between the architecture, the compiler, and the programmer/algorithm designer. See also [19] . For example, the XMT memory architecture requires a separate round-trip to the first level of the memory hierarchy (MH) over the interconnection network for each and every memory access; this is unless something (e.g., prefetch) is done to avoid it; and our LSRTM metric accounts for that. While we took advantage of Burton Smiths latency hiding pipelining technique for code providing abundant parallelism, the LSRTM metric guided design for good performance from any amount of parallelism, even if it is rather limited. Moving data between MH levels (e.g., main memory to first-level cache) is generally orthogonal and amenable to standard caching approaches. In addition to XMTC many other application-programming interfaces (APIs) will be possible; e.g., VHDL/Verilog, MATLAB, and OpenGL.
The well-developed PRAM algorithmic theory is second in magnitude only to its serial counterpart, well ahead of any other parallel approach. Circa 1990 popular serial algorithms textbooks already had a big chapter on PRAM algorithms. Theorists (UV included) also claimed for many years that the PRAM theory is useful. However, the PRAM was generally deemed useless (e.g., see the 1993 LOGP paper). Since the mid-1990s, PRAM research was reduced to a trickle, most of its researchers left it, and later book editions dropped their PRAM chapter. The 1998 state-ofthe-art is reported in Culler-Singhs parallel computer architecture book: ".. breakthrough may come from architecture if we can truly design a machine that can look to the programmer like a PRAM". In 2007, we are a step closer as hardware replaces a simulator and the interconnection network is being realized in ASIC. The current paper is part of an overall effort to advance the perception of PRAM implementability from impossible to available. The effort provides freedom and opportunity to pursue PRAM-related research, development and education without waiting for vendors to make the first move. For example, consider [18] .
Overview of the XMT Architecture The XMT processor (see Fig. A.1 ) includes a master thread control unit (MTCU), processing clusters (C0...Cn in Fig. A.1 ) each comprising several TCUs, a high-bandwidth low-latency interconnection network, memory modules (MMs) each comprising on-chip cache and off-chip memory, a global register file (GRF) and a prefix-sum unit. Fig. A.1 suppresses the sharing of a memory controller by several MMs. The processor alternates between serial mode, where only the MTCU is active, and parallel mode. The MTCU has a standard private data cache used only in serial mode and a standard instruction cache. The TCUs do not have a write data cache. They and the MTCU all share the MMs. [16] describes the way in which: (i) the XMT apparatus of the program counters and stored program extends the standard vonNeumann serial apparatus, (ii) virtual threads coming from an XMTC program (these are not OS threads) are allocated dynamically at run time, for load balancing, to TCUs, (iii) hardware implementation of the PS operation and its coupling with a global register file (GRF), (iv) independence of order semantics (IOS) that allows a thread to advance at its own speed without busy-waiting for other concurrent threads and its tie to Arbitrary CW, and (v) a more general design ideal, called no-busy-wait finite-state-machines (NBW FSM), guides the overall design of XMT. In principle, the MTCU is an advanced serial microprocessor that can also execute XMT instructions such as spawn and join. Typical program execution flow is shown on Fig. A.2 . The MTCU broadcasts the instructions in a parallel section, that starts with a spawn command and ends with a join command, on a bus connecting to all TCU clusters. In parallel mode a TCU can execute one thread at a time. TCUs have their own local registers and they are simple in-order pipelines including fetch, decode, execute/memory-access and write back stages. We aspire to have 1024 TCUs in 64 clusters in the future. A cluster has functional units shared by several TCUs and one load/store port to the interconnection network, shared by all its TCUs. The global memory address space is evenly partitioned into the MMs using a form of hashing. In particular, the cache-coherence problem, a challenge for scalability, is eliminated: in principle, there are no local caches at the TCUs. Within each MM, order of operations to the same memory location is preserved; a store operation is acknowledged once the cache module accepts the request, regardless if it is a cache hit or miss. Some performance enhancements were already incorporated in the XMT computer seeking to optimize LSRTM and queuing delay: (i) broadcast: in case most threads in a spawn-join section need to read a variable, it is broadcasted through the instruction broadcasting bus to TCUs rather than reading the variable serially from the shared memory.
(ii) Software prefetch mechanism with hardware support to alleviate the interconnection network round trip delay. A prefetch instruction brings the data to a prefetch buffer at the TCUs. (iii) Non-blocking stores where the program allows a TCU to advance once the interconnection network accepts a store request without waiting for an acknowledgement. (iv) Read-only-buffer: Within a TCU cluster, read requests to the same memory location from multiple TCUs operating concurrently can be replaced by a single request into the interconnection network. This optimization is applied only to addresses that cannot be written into by any concurrent thread.
Conclusion
Using on-chip low overhead mechanisms including a high throughput interconnection network XMT executes PRAM-like programs efficiently. As XMT evolved from PRAM algorithm, it gives (i) an easy generalpurpose parallel programming model, while still providing (ii) good performance with any amount of parallelism provided by the algorithm (up-and down-scalability), and (iii) backwards compatibility on serial code using its powerful MTCU with its local cache. Most other parallel programming approaches need more coarse-grained parallelism, requiring a (painful to program) decomposition step.
