Abstract-We consider the placement problem as part of the CAD flow for a massively parallel processor arrays (MPPAs). In contrast to traditional placers, which operate on a workstation with one or several cores and are able to take advantage of parallelism to a limited degree, we investigate running the placer on the target architecture itself. As the number of processor elements (PEs) in such a device scale, so too does the computational power available to the placer. This natural scaling helps avoid the long runtimes that afflict FPGA flows.
I. INTRODUCTION
As silicon geometries shrink, power and design complexity become first-class design constraints. As a result, massively parallel processor arrays (MPPAs) have become a research focus as single-core processor performance ceases to scale aggressively. In contrast to single-core designs deriving their performance from advanced control logic (e.g. branch prediction, out-of-order execution, transactional memory, vector units, etc.), MPPAs feature hundreds or thousands of relatively simple cores arranged in a single-chip array.
In addition to academic and affiliated research projects (e.g. PiCoGA [1] and PACT XPP [2] ), numerous commercial ventures (e.g. NethralAmbric, Tilera, Picochip, IntellaSys) have begun producing such devices. The number of cores they integrate is expected to scale rapidly.
Effective synthesis of programs for MPPAs remains a major challenge. Different approaches range from explicitly parallel programming models (such the approach used by Nethra/Ambric, in which numerous parallel programs are hand-coded in a Java-like language) and automatic synthesis and parallelization of code (e.g. [3] , which synthesizes parallel programs from Verilog source code.) In most of these models, once the user program has been synthesized into a set of parallel processes, these processes must be placed and routed in a sequence analogous to traditional CAD flows for FPGAs. To greatly oversimplify, the major difference between MPPA and FPGA flows is the larger size and greater capabilities of each element in the architecture, and the correspondingly lower number of elements.
In this paper, we consider fast placement algorithms for MPPAs. Since the problem is similar to FPGA placement, any placement algorithm suitable for FPGA flows may readily be adapted to MPPAs. Moreover, since the number of cores in such an architecture is several orders of magnitude lower than the number of placement elements (i.e. clusters) in an FPGA, such an approach would furnish a CAD flow with greatly reduced execution times. In order to shorten placement time further, we investigate a distributed placement algorithm intended to run on the MPPA itself. We do so for several reasons:
• We prefer to look to compilers (instead of FPGA flows) as a barometer for reasonable synthesis times, • As MPPA sizes scale upwards, placement complexity may again become a significant problem, • Fast placement will motivate research into more intelligent tools (e.g. combined placement and routing), and • A distributed approach in which the target silicon performs its own placement is interesting in its own right.
In the following section, we review previous research into distributed placement algorithms. Then, we introduce a distributed simulated annealing algorithm suitable for running on a MPPA.
II. BACKGROUND
In this section, we review several approaches to distributed placement. We restrict our focus exclusively to approaches based on Simulated Annealing (SA), since it is an established placement technique with a significant history in the FPGA community. There are few published placement algorithms specifically targeting MPPA flows, and especially for our purposes, SA's strengths as a general technique that may easily be extended to arbitrary placement constraints (e.g. to manage any architectural quirks of a particular MPPA) is desirable. A familiarity with simulated annealing is assumed; for an overview, please consult the references.
Several early works considered distributed simulated annealing via shared-memory and message-passing multiprocessors. Fig. 3 . Top-level architecture of MPPA in [3] Fig. 2 . Architecture of a single PE from [7] . Note the rel iance on contentassociative memory (CAMs) for position look-ups. 
from Verilo g RTL. The choice of this particular architec ture is arbitrary, in a se nse: Subj ect to per-node memory requirem en ts (wh ich are examined below), this investigation is eq ually rele vant to other MPPA arc hi tectures .
West Neighbour
In [4] , placem en t ce lls are divided dyn ami call y between processors. Eac h proce ssor may on ly selec t and move ce lls it 'ow ns.' Runtime speedups of 3.3 (using 4 proce ssors) and 6.4 (w ith 8 proce ssors) we re reported . In [5] , annealing is parallelized di fferently during two di stinct ph ases: in the hightempera ture ph ase, wh ere the probabili ty of accep ting an unfavou rable move is substa ntial, each proce ssor inves tiga tes a co mpletey distinct placement. After the se placement have been independently refined, the best is selec ted for the lowtemperature ph ase, during whi ch eac h processor is ass igned a ge ome tric partition of the chip. Each pr ocessor the n ann eals thi s partition , swapping nod es with the processor s working on adjace nt parti tion s. Speedups are up to 4.3 on a 5-processor syste m, and are proj ected to 7.1 on a lO-processor system . These early wo rks and others (e.g. [6] ) are character ized by the use of relatively few processors, and do not inves tiga te sca ling to hu ndred s or thousand s of proce sso rs.
M ore recently, a distributed pla cement algorithm for FPGAs which run s on an FPGA accelerator has been prop osed [7] . In this paper, placem ent occ urs on a systolic archi tec ture in which ea ch pl acem ent node (i.e. a clu ster) is ass igned its own pro cessor , and is permitted to co m municate only with its four nea rest neighbour s. The top ology and detailed archi tec ture of this appro ach are shown res pec tive ly in Fig . I and 2 . Th e pl acem en t architect ure was described in an HDL and sy nthes ized for FPGAs; as each processor eleme nt (PE) in thi s architecture requires many more FPGA resources than it manages, thi s appro ach required multiple FPGAs to generate placements for a single one. Top-level architecture of distributed SA from [7] . Each square represents a processing element in the systolic array. Nearest-neighbour links are shown using solid arrows; the position update chain (which furnishes PEs with estimates of block locations) is dashed.
In the foll ow ing sectio n, we adopt a simi lar app ro ach to placem ent on MPPAs. In do ing so, we find that ea ch core is suitably powerful to mana ge placement of a block its own size.
III. P ROPOSED A LGORITHM
Figs. 3 and 4 show the high-level and detailed MPPA descri bed in [3] . Th is architec ture con sists of an array of identical processor eleme nts (PE s) and a loc al routing fabric. Ea ch PE is a simple, 32-bit RISC-like core executing a program stored on loc al memory. Progr am s for eac h PE, as we ll as a static ro uting schedule for inter-PE comm unica tions, are synthesized We prop ose a sim ulated annealing algori thm with the followin g characteristics:
• Each PE is ass ume d to be a traditional RISC-like co re , with a mod er ate am ount of local memory and without specia lized structures (e.g . CAMs) , • Th e comp uta tio nal arc hitect ure (i.e . the res ources used to co mpute placem ent) are structured identicall y to the placem ent problem . In ot her words, an array of size (x , y) is used to ro ute a netli st invo lving at most x x y nodes.
• The SA implementation is conventional (i.e. it incl udes an exponential funct io n to rel ate prob ability of swap acce ptance, tempera ture, and di ffer ence in cos t.) • Communication between PEs is restri cted to the ir imm ediate neighbourhood (wh ich we will define below.)
In [7] , the number of logic resources used in an FPGA to compute placement of a single cluster were much larger than the clusters themselves. In contrast, each PE in an MPPA is capable enough to handle placement of an element its own size. Thus, the placement problem scales exactly with the architecture. The high-level pseudocode is similar: generate_random_placementO for interval in 0 to TMAX do for each PE do for n=1 to 'updates' do Update position chain end for for n= I to 'swaps' do Consider swaps with each PE in our neighbourhood end for end for end for This algorithm operates deterministically, and the placement results are reproducible. The second 'for' loop (which loops over each PE) is done in parallel, i.e. the loop 's contents define the program which runs independently on each PE.
In the following subsections, we provide some details on each of the steps . First, we describe the data structures used to track annealing and qualitatively describe the memory requirements of the algorithm .
A. Data Structures
We use the four structures shown in Fig. 5 to track PE contents:
pbm The Placement-to-Block Map maps each PE 's (x, y) location to the ID of the block it contains, or a token if the block is unoccupied. bnm The Block -to-Net Map maps each block ID to the IDs of every net with which it connects. nbm The Ncr-to-Block Map maps each net ID to the IDs of every block with which it connects. bpm The Block-to-Placement Map maps each block ID to the (x, y) location of the PE it currently occupies.
These same structures are applicable to an ordinary annealing implementation. We use forward-and reverse structures in ordinary RAM to avoid either costly list searches or special structures such as associative memories that would not be available in a generic MPPA. Of these structures, the net-to-block mapping (nb m) and the block-to-net mapping (b rim) encapsulate the block and net connections of the desired routing and are static throughout the annealing process . The other two (the block-to-placement map or bpm, and the placement-to-block map or pbm) link blocks with their (estimated) locations in the processor array, and are dynamic. Moreover, each PE's knowledge of blocks' locations is only approximate; their b pm and p b m structures are local and not synchronized with other PEs .
Assuming we allocate 16 bits for net and block IDs, and 16 bits for placement (8 bits each for the horizontal and vertical position), each single entry in the pbm and b pm occupies 32 bits. Such an allocation scales to 65536-core processors in a 256 x 256 array. Each entry in the nbm and b n m requires a variable number of 16-bit entries, depending on connectivity. It is unreasonable to store a complete version of any of these structures in each PE (for example, a complete p bm requires 256 x 256 x 2 = 131 kB, which is a trivial amount for a PC but an order of magnitude larger than PE memories in a typical MPPA.) Fortunately, a given PE only accesses the entries in its own connectivity structures (the b n m and nbm) that are relevant to the block it contains; other entries are unaccessed and need not be stored . A strategy for avoiding full placement maps (the pbm and b pm structures) is described in [7] ; this approach does impact the accuracy of each PE 's knowledge of its neighbouring blocks, and requires transfers of moderately large data structures during block swaps . We assume complete copies of each structure are available to each PE, and thus model the effects of information staleness (but not the effects of limited per-PE memory.)
B. Position Chain Updates
Any distributed annealing algorithm requires some method to ensure all processors have some "image" of the placement state, and that each processor's image is updated in some fashion . It is not necessary that these images be either synchronized or correct, provided they are eventually updated and the impact of stale information is limited.
Here (as in [7] ), each PE participates in an "update chain ", a ring that passes through each PE once and snakes its way through the topology . At each iteration of the inner loop, each PE passes information along the chain , updating its internal structures with information about block placement. When information about a particular block arrives at the PE containing it, this PE absorbs the stale data and pushes out new information. Using this mechanism, position updates occur with only nearest-neighbour communications.
The pseudocode for updating each PE is as follows: where ' Iocal_uq' and ' next_ uq' are input queues for the current and next PE in the update chain . The cost of maintaining the update chain is constant, and largely depends on the que ueing or local routing hardware available .
In [7] , the update chain takes a zigzag geometric loop through each PE. Here , because PE geometries change with each placem ent scenario , we have adopt ed a linear scheme that traverses each row, left-to-right, before jumping to the first column in the following row. This schem e is not necessarily appropriate for fabrication , but allows us to evaluate scenarios in which such a circuit cannot be construct ed (i.e. with an odd number of rows or columns.) Because simulations do not suggest that stale placement information is a prob lem, we have not investigated other update -chain policies. 
C. Neighbour-Neighbour Swaps
After the position chain has been updated , each PE considers block swaps with its neighbours. This process occurs in a synchronized fashion in a manner much like square dancing : every PE begins by pairing itself with a neighbouring PE and exchanging swap cost inform ation . Paired PEs agree on a random number (using local, simple pseudo -random number generators) and use this number to agree on whether the swap is accepted. If so, the PEs exchange block IDs and any relevant data structures used to track them . PEs then change partners and start again .
In [7] , swaps are consid ered for the 4 neighbouring PEs on the north, east , south, and west sides. As we will see, the use of these 5-PE neighbourhoods reduces placement quality; accordingly, we investigate two larger neighbourhoods : an 9-PE neighbourhood adds the possibi lity of swapping each block with PEs to the northeast , southeast, southwest, and northwe st; and a thirteen-PE neighbourhood adds the possibilit y of swapping each block with PEs two blocks to the north, east, south , Pseudocode for the swapping phases is as follows: for each neighb our do 6..C = calculate_swapped_costO -oucold_cost if randomt) < e-
C /
T then swapi) end if end for For each swap, the change in global cost is the change in the half-perimeter bounding box for each net conn ected to the two blocks undergoing swapping. This require s each PE to track the approximate position of all blocks with which its block is connected. The se position s need not be accurate, and each PE may use its own (possibly different) estimate without prob lems; as the temperature is reduc ed and fewer swaps are accepted, each PE's position estimate s converge to the 'true' state. Although the pseudo code indicates f1oating- To convert these figures into concrete units (i.e. placer runtime in seconds), more information about each placer's architecture (clock rate, interconnect bandwidth, local memory) are required. However, it is already clear that the computational requirements of distributed placement are several orders of magnitude lower than the traditional algorithm due to the reduced per-core swaps. (The fast clock and high IPC of modern workstation CPUs, relative to commercially available MPPAs, makes up some of this gap.)
for the distributed and traditional placers, we have scaled these parameters to give a consistent total number of swaps for the entire architecture. For the traditional placer, 2048 x 250 == 512,000 swaps are considered at each temperature step. The distributed placer considers only 250 "swap rounds" per core at each temperature step; since each swap round includes a possible swap with each block in its neighbourhood, the overall number of swaps for the distributed placer is 250 x 12 ---;-2 == 1500 (where 12 is the number of neighbours, and the division by 2 reflects pairing of nodes -it requires two nodes to consider a single swap.) This scaling permits results for both placers to be shown on the same graph.
We begin by investigating the relationship between placement quality and the number of swaps. With the 5-PE neighbourhoods shown in Fig. 6(a) , the placement quality in Fig. 7(a) is frequently over 10% worse than the reference solution and over 5% worse than the traditional placer. It is possible that with only four nearest neighbours to choose from, blocks become "fenced off" or trapped when their neighbours have highly favourable placements. In other words, there is insufficient path diversity for a given block to migrate to a favourable location in the presence of obstacles. A traditional placer, in which blocks are selected randomly and may swap independently of their relative proximity, does not suffer from this syndrome. Circumventing blockages can be difficult, unlikely, or effectively impossible for blocks depending on how content their neighbours are with their current location. point calculation, a fixed-point implementation could readily be obtained (e.g. by taking the logarithm of both sides, and using a log-weighted pseudorandom number generator.)
The computational cost of this stage depends on several factors, including the net-to-net connectivity (over which we iterate in calculate_swapped_cost) and the routing fabric. In our implementation, we assumed no limits on per-PE memory, which reduces the amount of neighbour-neighbour communication during a swap (only the local swap costs and block IDs need to be transferred.) If placement or connectivity information at each node is incomplete, these structures must also be swapped, and the neighbour-to-neighbour transfer speed (as well as the amount of PE supervision required) becomes important.
IV. RESULTS
In this section, we present some results from comparing a traditional (sequential) implementation of distributed annealing with our approach.
Both annealing algorithms were evaluated using six benchmarks. Five of these (chern, dir, honda, mcm, and pr) are DSP-or dataflow-style benchmarks taken from [8] . The final benchmark (me) is a motion-estimation algorithm, exhaustively searching for the offset of an 16 x 16-pixel subimage within a 32 x 32-pixel image with the minimal sum of absolute differences (SAD). All benchmarks are described in behavioural Verilog. From there, our CAD flow synthesizes Verilog into a dataflow graph (DFG) using a RISC-like instruction set. Nodes in the DFG are clustered, forming a program for each PE in the MPPA. The placement step requires each instruction cluster (block) to be assigned to a physical PE.
Benchmarks are listed, along with the half-perimeter cost of the best placement,' in Table I . Each placement has targeted a 32 x 32 architecture for a total of 1024 cores.
We evaluate the two placers by varying their parameters and investigating the corresponding change in placement quality. As the placement cost for each benchmark vary wildly, we have normalized each result to the placement cost in Table I; 1.0 indicates placement with excellent quality, and higher numbers indicate worse results. The default parameters used in the experimental results below are shown in Table II .
Since the "swaps" parameter (the number of node swaps considered at each temperature step) has a different meaning step are permitted, the distributed and traditional placer perform equivalently.
With the 13-PE neighbourhoods shown in Fig . 6(c) , placement qualit y is with 5% of the optimal placement and competetive with the traditional placer. These results, shown in Fig . 7(c) , suggest that adequate path diversity exists and that a favourably placed block no longer impedes movement of nearby block s.
In terms of computational requirements , the distributed placer using 5-PE neighbou rhoods considers a total of 250 x 4 --;-2 = 500 swaps per temperature step. The 9-and 13-PE neighbourhoods, respectively, requ ire 1000 and 1500 swaps per tempe rature step , a doubl ing and tripling of swap effort. Comp ared to the traditional placer, which perform s comparably, each PE in the distributed placer only considers a tiny fraction of the total numbe r of swaps: respectively, 1/1024 t h , 1/ 512 t h , and 1/256 t h as many. This adva ntage of the distributed placer over the traditional one improves as the size of the MPPA increases. Fig. 8 shows the normalized placement costs as the ex param eter (which control s the rate of temperature decrease) is varied from 0.95 to 0.999. Dotted lines represent the traditional placer, and solid lines represent the distributed algorithm. Both placers exhibit better-qual ity results with higher ex, as is expected. Note that with very high ex, experimental results actually dip beneath the 1.00 threshold, indicating that our estimated 'best-case ' placements were not actually optim al. With the 9-PE neighbourhood shown in Fig. 6(b) , placement qualit y varies with the number of swaps as shown in Fig. 7(b) . Path diversity (and thus placement quality) improves for most benchmarks. When a large numbe r of swaps per temperature Fig. 9 shows the norm alized placement cost as the starting temperature (To) is increased from 5 to 50. With a sufficiently high To, it is difficult to pick a trend out of this graph (except that both placers are extremely sensitive to starting conditions.) Fig. 10 shows the implementations ' sensitivities to the stopping tempe rature T st o p . The distributed place r is able to produ ce excellent placement results with a slightly higher T s t o p than the traditional one, further increasing its computational advantage. The ca use is straightforward: when T s t o p is sufficientl y low, non-greedy swaps effectively cease to as part of its own CAD flow. We showed how a distributed annealing algorithm can furnish placements with co mparable quality to ordinary simulated annealin g, with a number of itera tions that is several orde rs of magn itude lower, and with an architecture that scales naturally with the number of elements being placed . Overall , the distributed algorithm was able to provide competetive results with a vastly reduced workload per PE . Performance was within 5% of the best-case placements obtained, and competetive with a tradition al placer using a practical schedule, but requiring orde rs of magnitude less calculation per core. (At the operating poin t shown in Table II, the per-PE placement program requ ired I1256 th as many swaps per tempe rature step, comp ared to the traditional placer.) Not only were the per-core swaps reduc ed, but the distributed placer 's stopping temperature could be raised significantly without negatively impacting perform ance. In contrast to earlier work on distributed placement [4] , [5] , [6] , the extremely high processor count, as well as the low latency and high interconnect bandwidth of MPPAs provides encouraging results and excellent parallelism.
We expanded the notion of a PE's ' neighbourhood' (the nearby PEs with which it may swap blocks directly) beyond the four PEs immediately to its north , south, east, and west. We found that larger neighbou rhoods were a key to avoiding "stuck blocks" whose placements were suboptima l, but whose movement was impeded by immediate neighbours with satisfactory placements.
We considered the compu tational cost of the algorithm in generic term s, by exploring the numbe r of swap rounds required for a given level of performance . A natural extens ion of this work involves mapping it to a spec ific architecture, allowing a characterization of the placer 's perform ance in concrete units. Such an investigation requires some additional work on memo ry-efficient packing and fast exchan ges of each PE 's structures. Several other avenue s of investigation include bette r cost 
V. CO NCL USI ONS
In this paper, we explored the potential for MPPAs to perform self-hosted placement, i.e. for such a device to be used Fig. II shows the normalized placement cost as the ' updates' parameter is varied from 1 to 19. This parameter dete rmines the quality of each block's estimates of its neighbours' positions. Unle ss otherwise stated, the simulation results shown use I3-block neighbourhoods and 20 updates per swap round , sugges ting ju st over 1 position-chain update per swap. Fig. 11 suggests this is an adequ ate balance between information staleness and extra data-structure maintenance . occur in both placers . However, in the traditional placer, swaps are considered randomly and there is no guarantee that every advantageo us swap is actually evaluated. Addin g several temperature steps with negligible probability of accepting nongreedy swaps permi ts more greedy swaps to be co nsidered. These extra temperature steps are not required by the distributed placer.
functions (e.g. to improve power consumption or routability).
VI. ACKNOWLEDGEMENTS
This research is supported by the Natural Sciences and Engineering Research Council of Canada (NSERC).
