XMT-M: A Scalable Decentralized Processor by Berkovich, Efraim et al.
XMT-M: A Scalable Decentralized ProcessorEfraim Berkovich, Joseph Nuzman, Manoj Franklin, Bruce Jacob, and Uzi VishkinDepartment of Electrical and Computer Engineering, andUniversity of Maryland Institute for Advanced Computer Studies (UMIACS)University of Maryland, College Park, MD 20742AbstractA dening challenge for research in computer science and engineering has been the ongoing quest forreducing the completion time of a single computation task. Even outside the parallel processing com-munities, there is little doubt that the key to further progress in this quest is to do parallel processingof some kind. A recently proposed parallel processing framework that spans the entire spectrum from(parallel) algorithms to architecture to implementation is the explicit multi-threading (XMT) framework.This framework provides: (i) simple and natural parallel algorithms for essentially every general-purposeapplication, including notoriously dicult irregular integer applications, and (ii) a multi-threaded pro-gramming model for these algorithms which allows an \independence-of-order" semantics: every threadcan proceed at its own speed, independent of other concurrent threads. To the extent possible, the XMTframework uses established ideas in parallel processing.This paper presents XMT-M, a microarchitecture implementation of the XMT model that is possi-ble with current technology. XMT-M oers an engineering design point that addresses four concerns:buildability, programmability, performance, and scalability. The XMT-M hardware is geared to executemultiple threads in parallel on a single chip: relying on very few new gadgets, it can execute parallelthreads without busy-waits! Existing code can be run on XMT-M as a single thread without any modi-cations, thereby providing backward compatibility for commercial acceptance. Simulation-based studiesof XMT-M demonstrate considerable improvements in performance relative to the best serial processoreven for small, and therefore practical, input sizes.Keywords: Fine-grained SPMD, independence of order semantics, instruction-level parallelism (ILP),no-busy-wait nite state machines, parallel algorithms, prex-sum, and spawn-join.1 IntroductionThe coming years promise to be exciting ones in the area of computer architecture. Continued scaling ofsub-micron technology will give us orders of magnitude increase in on-chip hardware resources. Even byconservative estimates a single chip will have a billion transistors in a few years. Exploiting parallelismin a big way is a natural way to translate this increase in transistor count to completing individual tasksfaster.Parallelism had been traditionally exploited at coarse- and ne-grained levels. Emphasizing the build-able in the short term, traditional techniques targeting coarse-grained parallelism have focused primarilyon MPPs (massively parallel processors). Although MPPs provide the strongest available machines forsome time-critical applications, they have had very little impact on the mainstream computer market [9].1
Most computers today are uniprocessors, and even large servers have only modest numbers of proces-sors. A recent report from the President's Information Technology Advisory Committee (PITAC) [19]has acknowledged the importance and diculty of achieving scalable application performance on today'sparallel machines. According to the report, \there is substantive evidence that current scalable parallelarchitectures are not well suited for a number of important applications, especially those where the com-putations are highly irregular or those where huge quantities of data must be transferred from memory tosupport the calculation".The commodity microprocessor industry has been traditionally looking to ne-grained or instructionlevel parallelism (ILP) for improving performance, with sophisticated microarchitectural techniques (suchas pipelining, branch prediction, out-of-order execution, and superscalar execution) and sophisticatedcompiler optimizations, but with little help from programmers. Such hardware-centered techniques ap-pear to have scalability problems in the sub-micron technology era, and are already appearing to run outof steam. Compiler-centered techniques also are handicapped, primarily due to the articial dependenciesintroduced by serial programming.On analyzing this scenario, it becomes apparent that the huge investment in serial software has forcedprogrammers to hide most of the parallelism present in an application by expressing the algorithm ina serial form, and delegating it to the compiler and the hardware to re-extract (a part of) that hiddenparallelism. The result has been that both hardware complexity and compiler complexity have beenincreasing monotonically, with a less satisfying improvement in performance! However, we are reachinga point in time when such evolutionary approaches can no longer bear much fruit, because of increasingcomplexity and fast approaching physical limits. According to a recent position paper by Dally and Lacy[9], \over the past 20 years, the increased density of VLSI chips was applied to close the gap betweenmicroprocessors and high-end CPUs. Today this gap is fully closed and adding devices to uniprocessorsis well beyond the point of diminishing returns".To get signicant increases in computing power, a radically dierent approach may be needed. Onesuch approach is to \set free the crippled programmers", so that they are not forced to suppress theparallelism they observe, and are instead allowed to explicitly specify the parallelism. The books [2][8] [18] attest to the many great ideas that the parallel computing eld has developed over the years,although some of the ideas were ahead of the implementation technology and are still waiting to be putto practical use. Culler and Singh, in their recent book on Parallel Computer Architecture [8], mentionunder the title \Potential Breakthroughs" (p. 961): \breakthrough may come come from architecture if wecan somehow design machines in a cost-eective way that makes it much less important for a programmerto worry about data locality and communication; that is, to truly design a machine that can look to theprogrammer like a PRAM." The recently proposed explicit multi-threading (XMT) framework [29] wasinuenced by a hope that this can be done.We view ILP as the main success story form of parallelism thus far, as it was adopted in a big way inthe commercial world for reducing the completion time of general purpose applications. XMT aspires toexpand the ILP \parallelism bridgehead" with the \ground forces" of algorithm-level parallelism (which isguided by a sound theoretical foundation), by letting programmers express both ne-grained and coarse-grained parallelism in a natural way1.1Designed for reducing data access, communication and synchronization cost for current multiprocessors, there has beenthe parallel programming methodology as described in Section 2.2 of [8]. There has also been a related evolutionaryapproach to let programmers express some of the (coarse-grained) parallelism with the use of heavy-weight forks (carried outby the operating system) and light-weight threads (using library functions), to be run on multiprocessors. However, it hasnot yet been demonstrated that general-purpose applications could benet much from these techniques; two concrete butpointed examples are breadth-rst-search on graphs and searching directed acyclic graphs; more generally, irregular integerapplications of the kind taught in standard Computer Science algorithms and data-structure courses.2
The XMT framework also permits decentralized and scalable processors, with reduced hardware com-plexity. Decentralization is very important, because in the future, wire delays will become the dominantfactor in chip performance [31]. By wire delays we mean on-chip delay of connections between gates. TheSemiconductor Industry Association estimates that, within a decade, only 16% of a chip will be reachablein a clock cycle [31]. Microarchitectures will have to use decentralization techniques to tolerate longon-chip communication latencies, i.e., localize communication so as to make infrequent use of cross-chipsignal propagation.The objective of this paper is to explore a decentralized microarchitecture implementation for the XMTparadigm. The highlights of the investigated microarchitecture, called XMT-M, are: (i) decentralizedprocessing elements that can execute multiple threads in parallel, (ii) independence of order among theconcurrent threads, (iii) relaxed memory consistency model, and (iv) buildability (with current technol-ogy).The rest of this paper is organized as follows. Section 2 provides background material on the XMTframework. Section 3 describes XMT-M, a realizable microarchitecture implementation of the XMTparadigm. Section 4 presents an experimental analysis of XMT-M's performance, conducted with a de-tailed simulator. In particular, it shows that even for small input sizes, the XMT processor's performanceis signicantly better than that of the best serial processor that has comparable hardware. Section 5 dis-cusses related work and highlight the dierences with the XMT approach. Finally, Section 6 presents themajor conclusions and directions for future work.2 Explicit Multi-Threading (XMT)The XMT framework is grounded in a rather ambitious vision aimed at on-chip general-purpose parallelcomputing [29]; that is, presenting a competitive alternative to the state-of-the-art serial processors andtheir successors in the next decade. Towards that end, the broad XMT framework spans the entire spec-trum from algorithms through architecture to implementation. This section provides a brief descriptionof the XMT framework.2.1 The XMT Programming ModelThe programming model underlying the XMT framework is parallel in nature, as opposed to the serialmodel used in most computers. To be specic, XMT uses an arbitrary CRCW (concurrent read concurrentwrite) SPMD (single program multiple data) programming model. SPMD implies concurrent threads thatexecute the same code on dierent data; it is a more implementable extension of the classical PRAMmodel [18]. The XMT threads can be moderately long, providing some locality of reference2. Figure 1illustrates the XMT programming model. The (virtual) threads, initiated by the Spawn and terminatedby the Join, have the same code. At run-time, dierent threads may have dierent lengths, based on thecontrol ow paths taken through them. The arbitrary CRCW aspect of the model dictates that concurrentwrites into the same memory location result in having an arbitrary one among these writes to succeed;that is, one thread writes into the memory location while the others bypass to their next instruction.This permits each thread to progress at its own speed from its initiating Spawn to its terminating Join,without ever having to wait for other threads; that is, no thread ever does a busy-wait for another thread.Inter-thread synchronization occurs only at the Joins. We say that the XMT programming model inherits2It is important to note that traditional multiprocessors exploit coarse-grain parallelism with the use of very long threads,which provide even more locality. However, programmers nd it more dicult to reason about coarse-grain parallelismthan ne-grain parallelism. Thus, there is a trade-o between thread length (which aects locality of reference) andprogrammability. 3

























A B int base = 0; # a global variable, shared by all threads
Spawn(n) # spawn threads with ID 0 through n-1
{
int t_id;








# thread ID number: between 0 and n-1
# local variable, private to each thread
# check if array element is non-zero
# perform prefix-sum to get new value for e
# copy element from array A to B (index e)
# implied JoinFigure 2: The array compaction problem. The non-zero values in array A are copied to array B, inan arbitrary order. The code on the right hand side gives an XMT high-level program to solve the arraycompaction problem.2.3.1 Register ModelThe XMT ISA species two types of registers|global and local. The global registers are visible to allthreads of a spawn-join pair and the serial thread. The local registers are specic to each virtual thread,and are visible only to the relevant thread. To provide compatibility with existing binaries, assemblers,compiler, development tools, and operating systems, one can simply divide the register space into twopartitions so that the lower partition refers to the global registers and the upper partitions refers to thelocal registers. For instance, a reference to register R5 in the MIPS assembly language implies a globalregister access, while a reference to register R37 implies access to a local register.2.3.2 Memory ModelThe XMT framework supports a shared memory model; that is, all concurrent threads see a common,shared memory address space.Memory Consistency Model. The XMT memory model supports a weak consistency model betweenthe concurrent threads of a Spawn-Join pair. This stems from XMT's independence of order semantics.Memory reads and writes from concurrent threads are generally not ordered. When an inter-threadordering is required, a prex-sum instruction is used. The following example code illustrates this.sw t1, A(t0) # write local value to A[i], based on the thread IDpsi g1, t2, 1 # use PS to coordinate thread accesses, g1 is initialized to 0beq t2, r0, DONE # if the PS result is zero, we're donexori t3, t0, 1 # if i is even, t3=i+1; else t3=i-1lw t4, A(t3) # load A[t3] into t4The above code can result in the following sequence of events:Thread 0 Thread 1Write to A[0] Write to A[1]PS operation (gets 0) PS operation (gets 1)PS result is 0, so go to DONE PS result is 1, so continueRead A[0]5



















































JoinFigure 4: An XMT-M Processor. Communication paths in this diagram require multiple cycles.results. Clusters have a point-to-point connection to the global prex-sum unit to which they send theirprex-sum requests for processing.All prex-sum requests with a common base that arrive at the central prex-sum unit in the sametime slice are processed simultaneously, and the results are sent out simultaneously. The whole processis pipelined, so that a number of dierent prex-sum operations can be in ight at the same time. Thebase register contents and register identier are red out on a shared bus as soon as the rst requestfor a particular base arrives. The I/F (interface) unit in each cluster listens on the shared bus for thesebroadcasts.First, we notice that a prex-sum request that contributes zero to the base is equivalent to a read ofthe base, with no specied ordering. Thus, these requests can be handled locally at the cluster by readinga local copy of the base register. Non-zero prex-sum requests from the same cluster using the same basecan be combined into a single request to the global prex-sum unit. The global unit groups these clusterrequests into batches of the same base, performs a prex-sum across the batch, and broadcasts the results7
on the prex-sum bus. Each cluster listens on the bus, and derives its range of values within that cycle'sbatch. The cluster also updates its local copy of the base register. Each cluster assigns unique valuesfrom its prex-sum range to its local prex-sum requests. (A prex-sum hardware implementation thatavoids any serialization is described in [28]. We also note that in the case of 1-bit prex-sum requests, thenumber of wires necessary for the shared bus is c log2(n=c) + log2(c!), where c is the number of clustersand n is the number of TCUs.)3.2 Spawn/Join ImplementationThe XMT-M processor can be in one of two modes at any given time|serial or parallel. In the serialmode, only the rst TCU is active. Execution of a Spawn instruction causes a transition from the serialmode into the parallel mode. The spawning is performed by the Spawn Control Unit (SCU). The SCUdoes two main things: (i) it activates the TCUs whenever a Spawn is executed, and (ii) it discovers whenall virtual threads have been executed so that the processor can resume serial mode. The SCU broadcaststhe Spawn command the the number of threads (n) on a bus that connects to the TCUs. Each TCU,upon receiving the Spawn command, executes the owchart given in Figure 5.
Do PS to get new t_id
Set t_id = TCU_id
Is t_id > n?
Execute thread t_idInitiate Join
Start upon receiving Spawn(n) command
Figure 5: Flowchart depicting activities of a TCU upon receiving a Spawn commandIf the number of virtual threads is less than the number of TCUs, t, then all of the threads are initiatedat the same time. If the number of virtual threads is more than the number of TCUs, then the rst tvirtual threads are initiated in the t TCUs. The remaining threads are initiated as and when individualTCUs become free. Notice that it would have been straightforward to spawn the second set of threads(threads with label  t) after all of the TCUs have completed the execution of the rst set of threadsassigned to them. However, if some TCUs terminate and wait, while others continue, a \gross violation"of the NBW-FSMs ideal occurs. To alleviate this violation, we have devised a hardware-based schemethat makes use of the IOS between threads. As and when a TCU completes the execution of its thread,it sends a prex-sum request to the SCU. The SCU performs the prex-sum operation, and sends tothe waiting TCUs a new t id. Each of the waiting TCU makes sure that its t id is less than n beforeproceeding to execute the thread; otherwise, the TCU initiates a join operation. This type of threadallocation continues until all virtual threads of a Spawn instruction have been initiated.In the parallel mode, we also take advantage of the SPMD style instruction code in the following way.The code is broadcast on the bus so that all TCUs can simultaneously get the code to be executed.3.3 Global Register CoordinationIt is likely that for global register coordination, we can use the same broadcast data bus as the prex-sumunit. Because that bus activity depends on the frequency of prex-sum operations, we can use the extra8
capacity to broadcast global register values when they are written. Each local register le can keep thevalues of the global registers for use by the cluster functional units.Whenever a thread writes to a global register, the new value is written to the local copy of the sharedregister and then is sent out on the shared bus when the bus becomes available. To maintain coherence,the processor does not restart serial mode after a join until all register writes have been broadcast. Also,if one thread writes to a shared register and another thread (or threads) needs to read that value, thoseaccesses are prioritized by using a gatekeeper prex-sum. In such a situation, the thread that writes tothe register would issue its prex-sum request only after its write has gone out on the bus and is thereforevisible to all the clusters. In this way, a delayed coherence can be maintained across all the global registercopies, and the clusters can safely use their local global register values without fear that the registervalues are incorrect. Note that such a relaxed consistency model is possible among the register les,because the XMT programming model allows it.3.4 Memory SystemThe XMT framework supports a shared memory model; that is, all concurrent threads see a common,shared memory address space. We can think of two alternatives for implementing the top portion of thememory hierarchy for such a system|shared cache and distributed caches. The shared cache implemen-tation has the advantage of not having to deal with issues such as cache coherency. However, its accesstime is likely to be higher, because of interconnect demands. The distributed cache implementationpermits each TCU or cluster to have a local cache, thereby providing faster access to the top portion ofthe memory hierarchy. However, it has to deal with the problem of maintaining coherency between themultiple caches. Further research is needed to determine which of the two options is best for the XMTframework for dierent technologies. For instance, if a high miss rate exists in the local caches, then eachmemory access is likely to be comparable in duration to an access of a shared cache. In that case, itmakes sense to avoid the problem of cache coherence in the design, and implement only a single sharedcache. The emphasis in this paper on a decentralized architecture component, and existing technologyled us in the direction of local caches.The memory system we investigate in this paper for XMT-M is as follows. Each cluster has a smalllevel-1 cache. Multiple level-1 caches are connected together by an interconnect. The next level of thememory hierarchy consists of a large shared cache (level-2 cache), which connects to main memory. Alarge number of pipelined memory requests can be pending at a time, as in the Tera processor [3]. Theidea is to use an overabundance of memory requests at each level of the memory hierarchy to hide memorylatency. The interconnect used to tie the caches can be a crossbar, a shared bus, a ring, etc.3.4.1 Cache CoherenceWhen a shared memory model is implemented in a distributed manner, maintaining a consistent viewof the memory for all the processing elements is vital. The use of distributed caches necessitates imple-menting protocols for maintaining cache coherence. Cache coherence protocols come under two broadcategories|invalidate-based and update-based. In a write-back write-invalidate coherence scheme, a pro-cessor doing a write waits to get access to the shared bus. It then broadcasts the write address. Theother caches snoop the bus and invalidate that block if present. The writing processor then has exclusiveaccess to that cache block and keeps writing to the copy in its local cache. Another processor readingthat same block will cause a read miss at its cache, and after getting access to the bus will send a readrequest to memory. Because the processor with exclusive access to the block is snooping the bus, it willgain access to the bus, and send the updated version of the block and abort the access to memory. Thistype of protocol works well when there is not much of data sharing.9
In the analogous case in an update-based protocol, a processor sends a write request to its local cache,and the request gets broadcast to the local caches of all processors. Upon receiving the update, the localcaches update the relevant block if present. Any processor that needs to read a memory location willget the value from its local cache or from the next level of the memory hierarchy. This type of protocolgenerally results in high bandwidth requirements, because of using write-through caches.
Case 2Case 1 Case 3
Parallel thread X writes to location A
Join
(Serial or parallel) thread Y reads from location A
(Serial) thread X writes to location A
Spawn
Parallel thread Y reads from location A
Parallel thread X writes to location A
Parallel thread Y reads from location A
Prefix-sum (gatekeeper)Figure 6: Cache Coherence Hazards for XMTFor the XMT-M memory system, we chose an update-based protocol because the XMT memory con-sistency model (by virtue of its independence of order semantics (IOS)) permits a relaxed update policy.Therefore, after a TCU sends a write request to its local cache, it can generally continue executing itsthread without waiting for the write to be globally performed. However, there are three cases allowed bythe programming model where this write-and-continue policy must be modied. These three \coherencehazards" are summarized in Figure 6. The rst case is that writes from the serial thread must completebefore the system initiates a spawn. The second case is that writes from spawned threads must completebefore the system restarts the serial thread. The third case occurs when a prex-sum instruction anda branch based on the outcome of that prex-sum instruction separate a read from a write. Thus, thewriting TCU will stall executing its (1) spawn, (2) join, or (3) gatekeeper prex-sum operation untilit is sure that its write request has reached all other caches. The programming model guarantees thatno thread will attempt to read the updated data before the next spawn, join, or gatekeeper prex-sumoperation executes. Thus, the caches are allowed to become inconsistent with each other for extendedperiods of time. This protocol may occasionally stall the writing thread; the stall time depends on howlong it takes to broadcast a write to all the caches. With a ring-based interconnect, this stall time wouldbe the time it takes for a write request to go around the ring plus the time it takes for the local ringsegment to become available, but even in that case no busy-waits occur for the remaining threads.4 Experimental EvaluationThe previous section presented a detailed description of a decentralized XMT microarchitecture. Next,we present a detailed simulation-based performance evaluation of XMT-M.4.1 Experimental FrameworkWe have developed an XMT-M simulator for evaluating the performance potential of XMT-M and theXMT framework in general. This simulator uses the SimpleScalar ISA, with four new instructions added:a Spawn, a Join, and two Prex-sums. The register set is also expanded to have both global and localregisters. The simulator accepts XMT assembly code, and simulates its execution. All important featuresof the XMT-M system have been incorporated in the simulator.10










Traverse a linked list randomly dispersed through memory and find the sum of 
the list item data values. This application is not one which we know how to 
parallelize, so it is implemented with a serial algorithm.
50 item list spaced in 
200 words, 500 in 2K 
words, 5K in 20K words, 






Based on the STREAM benchmark [26], we sequentially read arrays, perform 
some short calculations on the values, and write the results to another array. 
Since each iteration of the loop is independent, parallelization of execution is 
obvious. In the superscalar domain, one approach for speeding up this code is 
loop unrolling; we do that for the SimpleScalar version.
50 item array, 500 item 







Compacting an array, we take a sparse array and rewrite into a compact form. 
This application requires keeping a running count of the next available location 














uses indirection through another array (irregular memory access).
50 item array, 500 item 
array, 5K item array, 
250K item array 








Find the maximum value of a list. In the serial case, we read through the list, 
keeping a running maximum. For the parallel case we choose a synchronous 
max-finding scheme. A balanced binary tree is formed where a node of the tree 
will have the result of a maximum operation on its two child nodes. The root of 
the tree will have the maximum of the list. The algorithm proceeds from leaves 
to root, synchronizing after every level in the tree. The threads are very short 
and there are log(n) spawn-joins. 
50 item array, 500 







Unraveling a linked list of known length which is packed within an array. This is 
a version of the problem called “list-ranking”. This application is useful for 
managing linked-list free space in OSes [25]. In the serial algorithm, we 
traverse the list and rewrite it in the proper order. For the XMT version, we use 
two algorithms: (1) Wyllie’s pointer jumping algorithm [18] for the 50 and 500 
sized inputs and (2) the no-cut coin-tossing algorithm for the 5K and 250K 
sized inputs. The work [10] presented discussion of the various list-ranking 
algorithms on XMT.
50 item array, 500 item 
array, 5K item array, 
250K item array.
A variant of radix-sort. It sorts integers from a range of values by applying bin-
sort in iterations for a smaller range. For speed-up evaluations, one would wish 
to compare integersort with the fastest serial sorting algorithms, and not only 
serial radix-sort, as we did; however, the literature implies that for some 
memory architectures radix-sort is fastest [1], while for others other sorting 
routines are fastest [23].
50 item array, 500 item 
array, 5K item array, 
250K item array.Figure 7: Benchmarksthose parallel programs to XMT code. We recognize the importance of eventually carrying out studieswith entire applications.A compiler-like translation from high-level to optimized assembly code is done manually, for lack ofan XMT compiler. To have a fair basis for comparison, the serial versions of the applications are alsogenerated by hand, and are optimized by using techniques such as loop unrolling.We use small input data sizes to illustrate that the XMT-M processor can achieve better performancefor even small input sizes. While comparing the performance against superscalar processors, the metricused is speedup (obtained by dividing the number of execution cycles taken by the superscalar processorby the number of cycles taken by XMT-M). 12
4.2 XMT-M PerformanceIn the rst set of experiments, we measure the number of cycles taken by dierent XMT-M congurationsto execute the benchmarks, with varying size inputs. The number of cycles taken by the XMT-M con-gurations are compared against those taken by the default centralized wide-issue superscalar processor.Figure 8 presents the results obtained. The gure consists of 4 diagrams, corresponding to 4 dierentinput sizes. In each diagram, the X-axis represents the benchmarks. For each benchmark, 3 histogramsare plotted, one for each XMT-M conguration (a 2-cluster XMT-M, an 8-cluster XMT-M, and a 32-cluster XMT-M). The Y-axis denotes the speedup of the XMT-M congurations over that of the defaultsuperscalar conguration. Notice that comparing the IPCs (instructions per cycle) for the two processorswill not be meaningful, as they execute dierent programs.
















linkedlist listsort integersort max stream arrcomp_d arrcomp_i
Benchmark / Configuration
Input Size 50















linkedlist listsort integersort max stream arrcomp_d arrcomp_i
Benchmark / Configuration
Input Size 500
















linkedlist listsort integersort max stream arrcomp_d arrcomp_i
Benchmark / Configuration
Input Size 5K
















linkedlist listsort integersort max stream arrcomp_d arrcomp_i
Benchmark / Configuration
Input Size 250K
Figure 8: XMT-M Speedups relative to Serial ComputingLet us look at the results of Figure 8 closely. For an input size of 50 (cf. the rst diagram in Figure 8),the 8-cluster XMT-M conguration performs better than the other two XMT-M congurations. The 2-cluster XMT-M does not have enough parallel resources to harness inter-thread parallelism; and not muchinter-thread parallelism is available with an input size of 50 to compensate for the high latencies of the32-cluster XMT-M. The 8-cluster performs substantially better than the centralized superscalar processor13
for three of the benchmarks, and performs slightly worse than the centralized superscalar processor forthree of the benchmarks.When the input size is increased to 500 (cf. the second diagram in Figure 8), the 32-cluster XMT-M (despite its increased cross-chip latency) begins outperforming the 8-cluster XMT-M for most of thebenchmarks, because of the increased inter-thread parallelism available. Even the 2-cluster XMT-Mconguration outperforms the centralized superscalar processor in all but one of the benchmarks.When the input size is increased beyond 500, the XMT-M congurations continue to harness moreparallelism, as might be expected. It is important to point out that these results have to be analyzedin the proper context. The XMT-M congurations are performing in-order execution and single-instruction issue in each TCU. Thus, the TCUs in an XMT-M processor are not performing functionssuch as branch prediction, dynamic scheduling, register renaming, memory address disambiguation, etc.In short, the XMT-M TCUs are not exploiting any intra-thread parallelism, except for theoverlap obtained in a 6-stage pipeline. This is very important. First, it suggests that our speedup resultsare conservative. Second, in the future as clock cycles continue to decrease, it becomes more and moredicult to perform centralized tasks such as dynamic scheduling and branch prediction within a limitedcycle time. We would also like to point out that the XMT framework does not preclude the use ofconventional techniques to extract intra-thread parallelism.4.3 Eect of Cross-chip Communication Latency on XMT-M PerformanceOur next set of experiments focus on studying the eect of global (cross-chip) communication latency onXMT-M performance. Cross-chip communication refers to prex-sum operations, spawn/join operations,global register accesses, and L1 cache accesses. Three dierent latencies were modeled for one-way inter-cluster communication: 1, 4, and 16 clock cycles. This corresponds to latencies of 2, 8, and 32 cycles,respectively, for functions that require two-way communication.Figure 9 gives the results obtained in this study. Again, four diagrams are given, corresponding to fourdierent input sizes. Each diagram records three histogram bars for each benchmark. Thus, the X-axisrepresents the 7 benchmarks, and for each benchmark the three inter-cluster communication latencies.The Y-axis represents the normalized performance; normalization is done with respect to the performanceof the unit-latency conguration. That is, the height of the bar represents the value obtained by dividingthe execution time of the benchmark for a latency of 1 cycle by the execution time of the benchmark forthe corresponding latency.The results of Figure 9 indicate that for all input sizes considered, the performance of XMT-M does notchange when the inter-cluster communication latency is increased from 1 to 4. Even when this latency isincreased to 16 cycles, XMT-M's performance remains the same except for the two arrcomp benchmarks,for which there is a drop of about 5%. These results indicate that XMT-M is somewhat resilient toincreased cross-chip interconnect delays.5 Related WorkFirst of all, we should emphasize that we have not \invented parallel computing" with XMT. We havetried to build on available technologies to the extent possible.The relaxation in the synchrony of PRAM algorithms is related to the works of [7] and [15] on asyn-chronous PRAMs. The high-level language we used for XMT builds on Fork95 and its previous versionsdeveloped at the U. Saarbrucken, Germany, see [20] and [21]. Basic insights concerning the use of aprex-sum like primitive go back to the Fetch-and-Add [16] or Fetch-and-Increment [14] primitives (cf.[2]). Insights concerning nested Spawns rely on the work of Guy Blelloch's group [4], [5] and others.14

















linkedlist listsort integersort max stream arrcomp_d arrcomp_i
Benchmark / Cycles Latency
Input Size 50

















linkedlist listsort integersort max stream arrcomp_d arrcomp_i
Benchmark / Cycles Latency
Input Size 500

















linkedlist listsort integersort max stream arrcomp_d arrcomp_i
Benchmark / Cycles Latency
Input Size 5K

















linkedlist listsort integersort max stream arrcomp_d arrcomp_i
Benchmark / Cycles Latency
Input Size 250K
Figure 9: Eect of Cross-Chip (Global) Communication Latency on XMT-M PerformanceThe U. Wisconsin Multiscalar project [13] and the U. Washington simultaneous multi-threading (SMT)project [27] with their use of multiple program counters and the computer architecture literature onmulti-threading (see, for instance [17]) have also been very useful; however, the way XMT proposes toattack the completion time of a single task, which is so central to XMT, makes XMT drastically dierentthan these approaches; that is, the reliance on PRAM algorithms. Our experience has been that someknowledge of PRAM algorithms is a necessary condition for appreciating how big the dierence is.Simultaneous multi-threading [27] improves throughput by issuing instructions from several concur-rently executing threads to multiple functional units each cycle. However, SMT does not appear tocontribute towards designing a machine that look to the programmer like a PRAM [30].The similarity of XMT to the pioneering Tera Multi-Threaded architecture [3] is very limited. Terafocuses on supporting a plurality of threads by constantly switching among threads; it does not issueinstructions from more than one thread at the same cycle; this, in turn, would limit the relevance of ourmulti-operand and Spawn instructions for their architecture. Tera's multiprocessor was engineered tohide long latencies to memories for big applications. Its design aims at 256 processors each running 128threads. XMT, on the other hand, is designed to provide competitive performance for even small input15
sizes, which makes it more practical, in general, and for desktop applications, in particular. To explainthis, we observe that higher bandwidth and lower latencies, which are expected from on-chip designs inthe billion transistor era, will allow a parallel algorithm to become competitive with its serial counterpartfor a much smaller input size than for MPP paradigms such as Tera. But, why does this observationhold true? Parallelism provided by a parallel algorithm increases as the input size increases; now, whenimplemented on an MPP, part of this algorithm parallelism is used for hiding system deciencies (suchas latency); when latency is a minor problem, more algorithm parallelism can be applied directly tospeed-ups. So, parallel algorithms can become competitive for much smaller inputs.Another strong argument in favor of XMT is that it is anticipated that explicit parallelism will providefor simpler hardware. Explicit ILP generally means \static" extraction of ILP, which allows for simplerhardware than that for dynamic (i.e., by hardware) extraction of ILP. Stating simpler hardware as themain motivation, industry has demonstrated its interest in explicit instruction-level parallelism (ILP), byway of heavily investing in it.Lee and DeVries investigate single-chip vector microprocessors as a way to exploit more parallelism withless hardware and reduced instruction bandwidth [24]. They expose more instruction-level parallelismto the processor core by moving to a more explicitly parallel programming model. Similarly, the IRAMproject uses vector processing to increase parallelism; the project also aims to fully exploit the bandwidthpossibilities of integrating DRAM onto the microprocessor [22]. Complementing this research, out-of-order, multi-threaded, and decoupled vector architectures have been proposed by Espasa, Mateo, andSmith [11] [12] as methods to improve the performance of vector processing. The XMT architecture hasmuch in common with the recent vector approaches, as it is solving many of the same problems in muchthe same way; the primary dierence is in the use of a SPMD-style parallel algorithm programming modelinstead of a vector model.6 Concluding RemarksThe \von Neumann architecture" has provided a principled engineering design point for computing sys-tems for over half a century. It has been very robust and resilient, withstanding dramatic changes inall the relevant technologies. Parallel computing has long been considered an antithesis to the von Neu-mann architecture. However, the main success thus far in using parallelism for reducing completion timeof a single general-purpose task has been accomplished by what we called earlier the ILP bridgehead -yet another von Neumann architecture! However, technology changes due to the evolving sub-microntechnology are making it increasingly dicult to extract parallelism by conventional ILP techniques.In the past 20 years, the increased transistor budget of processor chips was applied to close the gapbetween microprocessors and high-end CPUs. As pointed out in [9], today this gap is fully closed andusing the additional transistor budget for uniprocessors is well beyond the point of diminishing returns.It is becoming increasingly important, therefore, to tap into explicit parallel processing. The XMTapproach tries to jump into lling this gap; the \no-busy-waits nite-state-machine" principle with itsimplied independence of order semantics (IOS) are designed to directly address the profound weaknessof continued evolutionary development of the von Neumann approach.The [29] paper introduced the broad XMT framework through few \bridging models" (or internalinterfaces). This paper contributes a rst microarchitecture implementation | a signicant step for theoverall XMT eort. The research area of XMT (and SPMD in general) as it connects with parallelalgorithms oers an interesting design point: these algorithms map well to hardware support for multipleindependent concurrent threads. This hardware support, in turn, maps well to the limitations of futuresub-micron technologies|interconnect delays dominating the performance|which necessitate most of the16
communication to be localized. The XMT-M implementation highlights an important strength of XMT:it lends itself to decentralized implementation with almost no degradation in performance. An XMT-Mprocessor can comprise numerous, simple, independent, identical thread execution units, and the natureof the programming paradigm is such that inter-thread communication is highly structured and regular.This paper shows that such a microarchitecture can withstand high cross-chip communication delays.It can also take advantage of a large number of functional units, as opposed to traditional superscalardesigns, which typically cannot make use of more than one or two dozen functional units.XMT-M integrates several well-understood and widely-used programming primitives that are usuallyimplemented in software; the novelty of the microarchitecture is the integration of these primitives ina single-chip environment, which oers increased communication bandwidth and signicantly decreasedcommunication latency compared to more traditional parallel architectures. The integrated primitivesare the spawn-join mechanism, which enables parallelism by initiating and terminating the concurrentexecution of multiple threads of control, and the prex-sum operation, which is used to coordinate thethreads.Finally, we note that XMT-M has achieved signicant speedups by extracting only inter-thread paral-lelism (and no intra-thread parallelism), and that it can get additional benet from extracting intra-threadparallelism with the use of standard ILP techniques.References[1] R. C. Agarwal, \A Super Scalar Sort Algorithm for RISC Processors," Proc. ACM SIGMOD, 1996.[2] G.S. Almasi A. Gottlieb. Highly Parallel Computing, Second Edition. Benjamin/Cummings, 1994.[3] R. Alverson, D. Callahan, D. Cummings, B. Koblenz, A. Portereld, and B. Smith, \The Tera Computer System,"Proc. International Conference on Supercomputing, 1990.[4] G. E. Blelloch, Vector Models for Data-Parallel Computing. MIT Press, 1990.[5] G.E. Blelloch, S. Chatterjee, J.C. Hardwick, J. Sipelstein, and M. Zagha, \Implementation of a portable nested data-parallel language," Proc. 4th ACM PPOPP, pp. 102-111, 1993.[6] D. Burger and T. M. Austin, \The SimpleScalar Tool Set, Version 2.0," Tech. Report CS-1342, University of Wisconsin-Madison, June 1997.[7] R. Cole and O. Zajicek, \The APRAM: incorporating asynchrony into the PRAM model," Proc. 1st ACM-SPAA, pp.169-178, 1989.[8] D. E. Culler and J. P. Singh, Parallel Computer Architecture. Morgan Kaufmann, 1999.[9] W. J. Dally and S. Lacy, \VLSI Architecture: Past, Present, and Future," Proc. Adv. Research in VLSI Conf., 1999.[10] S. Dascal and U. Vishkin, \Experiments with List Ranking on Explicit Multi-Threaded (XMT) Instruction Paral-lelism," Proc. 3rd Workshop on Algorithms Engineering (WAE-99), July 1999, London, U.K. Downloadable fromhttp://www.umiacs.umd.edu/~vishkin/XMT/.[11] R. Espasa and M. Valero, \Multithreaded Vector Architectures," Proc. Third International Symposium on High Per-formance Computer Architecture (HPCA-3), pp. 237-248, 1997.[12] R. Espasa, M. Valero, and J. E. Smith, \Out-of-order Vector Architectures," Proc. 30th Annual International Symposiumon Microarchitecture (MICRO-30), pp. 160-170, 1997.[13] M. Franklin, \The Multiscalar Architecture," Ph.D. thesis. Technical Report TR 1196, Computer Sciences Department,University of Wisconsin-Madison, December 1993.[14] E. Freudenthal and A. Gottlieb, \Process Coordination with Fetch-and-Increment," Proc. Fourth International Con-ference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-IV), 1991.[15] P.B. Gibbons. \A more practical PRAM algorithm," Proc. 1st ACM-SPAA, pp. 158-168, 1989.[16] A. Gottlieb, B. Lubachevksy, and L. Rudolph, \Basic techniques for the ecient coordination of large numbers ofcooperating sequential processors," ACM Transaction on Programming Languages and Systems 5,2, pp. 105-111, 1983.17
[17] R. A. Iannucci, G. R. Gao, R. H. Halstead, and B. Smith (editors). Multithreaded Computer Architecture - A Summaryof the State of the Art. Kluwer, Boston, MA. 1994.[18] J. Ja'Ja'. An Introduction to Parallel Algorithms. Addison-Wesley, Reading, MA, 1992.[19] R. Joy and K. Kennedy. President's Information Technology Advisory Committee (PITAC) - Interim Report to thePresident. National Coordination Oce for Computing, Information and Communication, 4201 Wilson Blvd, Suite 690,Arlington, VA 22230, August 10, 1998.[20] C.W. Kessler and H. Seidl, \Integrating synchronous and asynchronous paradigms: the Fork95 parallel programminglanguage," Technical report no. 95-05, Fachbereich 4 Informatik, Univ. Trier, D-54286 Trier, Germany, 1995.[21] C.W. Kessler, \Quick reference guides: (i) Fork95, and (ii) SB-PRAM: Instruction set simulator system software,"Universitat Trier, FB IV -Informatik, D-54286 Trier, Germany, May 1996.[22] C. Kozyrakis, et al., \Scalable Processors in the Billion-Transistor Era: IRAM," IEEE Computer, Vol. 30, pp. 75-78,September 1997.[23] A. LaMarca and R. E. Ladner, \The Inuence of Caches on the Performance of Sorting," Proc. 8th Annual ACM-SIAMSymposium on Discrete Algorithms, pp. 370-379, 1997.[24] C. G. Lee and D. J. DeVries, \Initial Results on the Performance and Cost of Vector Microprocessors," Proc. 30thAnnual International Symposium on Microarchitecture (MICRO-30), pp. 171-182, 1997.[25] A. Silberschatz and P. B. Galvin, Operating System Concepts, Fifth Edition. Addison Wesley Longman, Inc., 1998, p.384.[26] \STREAM: Sustainable Memory Bandwidth in High Performance Computers," The University of Virginia,http://www.cs.virginia.edu/stream/ .[27] D. M. Tullsen, S. J. Eggers, J. S. Emer, H. M. Levy, J. L. Lo, and R. L. Stamm, \Exploiting Choice: Instruction Fetchand Issue on an Implementable Simultaneous Multithreading Processor," Proc. 23rd Annual International Symposiumon Computer Architecture (ISCA), pp. 191-202, 1996.[28] U. Vishkin, \From Algorithm Parallelism to Instruction-Level Parallelism: An Encode-Decode Chain Using Prex-sum,"Proc. 9th ACM Symposium on Parallel Algorithms and Architectures (SPAA), pp. 260-271, 1997.[29] U. Vishkin, S. Dascal, E. Berkovich, and J. Nuzman, \Explicit Multi-threaded (XMT) Bridging Models for InstructionParallelism," Proc. 10th ACM Symposium on Parallel Algorithms and Architectures (SPAA), pp. 140-151, 1998. Seealso, the XMT home page http://www.umiacs.umd.edu/~vishkin/XMT/[30] Personal communication with H. Levy, August 1997.[31] \The National Technology Roadmap for Semiconductors," Semiconductor Industry Association, 1997.
18
