This paper presents a new distributed computation model adapted to manycore processors. In this model, the run is spread on the available cores by fork machine instructions produced by the compiler, for example at function calls and loops iterations. This approach is to be opposed to the actual model of computation based on cache and predictor. Cache efficiency relies on data locality and predictor efficiency relies on the reproducibility of the control. Data locality and control reproducibility are less effective when the execution is distributed. The computation model proposed is based on a new core hardware. Its main features are described in this paper. This new core is the building block of a manycore design. The processor automatically parallelizes an execution. It keeps the computation deterministic by constructing a totally ordered trace of the machine instructions run. References are renamed, including memory, which fixes the communications and synchronizations needs. When a data is referenced, its producer is found in the trace and the reader is synchronized with the writer. This paper shows how a consumer can be located in the same core as its producer, improving parallel locality and parallelization quality. Our deterministic and fine grain distribution of a run on a manycore processor is compared with OS primitives and API based parallelization (e.g. pthread, OpenMP or MPI) and to compiler automatic parallelization of loops. The former implies (i) a high OS overhead meaning that only coarse grain parallelization is cost-effective and (ii) a non deterministic behaviour meaning that appropriate synchronization to eliminate wrong results is a challenge. The latter is unable to fully parallelize general purpose programs due to structures like functions, complex loops and branches.
Introduction
To run a program in parallel on a multi-core processor today, one must rewrite its C code to add by hand or through tools some OS parallelizing primitives such as pthread. Even with high level interfaces like OpenMP or MPI, parallelizing is not an easy job for two reasons: (i) if the resulting code is not enough synchronized, the computation is not deterministic and (ii) if it is too much synchronized, it is not parallel enough. This is illustrated by figure 1 Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. showing the C implementation of a vector sum reduction and figure  2 showing its pthread implementation.
u n s i g n e d l o n g sum ( u n s i g n e d l o n g a r r a y [ ] , u n s i g n e d l o n g n ){ i f ( n==1) r e t u r n a r r a y [ 0 ] ; i f ( n==2) r e t u r n a r r a y [ 0 ] + a r r a y [ 1 ] ; r e t u r n sum ( a r r a y , n / 2 ) + sum(& a r r a y [ n / 2 ] , n−n / 2 ) ; } The pthread coding is known to be tricky. A first (incorrect) version has no pthread join synchronization, which highlights (i). As a result, the run is not deterministic and the returned sum may be incorrect. A second version places the first pthread join synchronization on line 11 which serializes the second recursive call after the first one, which highlights (ii). As a result, the run is not parallel enough. Only the third version which places the two joins on lines 16 and 18 is satisfactory.
This example abstracts a first difficulty of hand-made parallelization with non deterministic OS primitives: achieve the exactly needed synchronization.
Instead of hand-parallelizing the code, the developer may rely on its compiler to automatically do loop vectorization or loop parallelization. A simple example as the program given on figure 1 cannot be automatically parallelized by gcc. Irregular code structures are a second problem. Section 3 explains how loops are parallelized in our approach.
A third problem of parallelization is the memory organization of the data [1] . In the sum example, the array to be summed is declared for example as a global variable. Hence, it is centralized when the computation is distributed. As a result, each thread brings the array pieces it needs from DRAM, where it resides. Cache can help but neighbour cores are slowed down by memory contention and the cache miss rate is impacted by the array distribution. Moreover, in a program updating shared data, keeping caches coherent requires complex hardware which slows down average memory access time. Caches and memory hierarchy, as well as branch predictors, are hardware features that rely on the principle of locality, which in essence is founded on the centralisation of data (caches) and fetched code (predictors). When the code and the data are distributed, it is the parallel locality which applies to data. The parallel locality principle is that a consumer should be as close as possible from its producer. In this paper we propose a measure of the producer to consumer distance, which is a way to quantify the parallelization quality.
Section 4 defines the parallel locality and the parallelization quality and shows a programming technique to increase them. Section 5 contains a matrix multiplication program with an improved parallelization quality. 1 t y p e d e f s t r u c t { l o n g * p ; u n s i g n e d l o n g i ; } ST ; In [5] , a parallelizing core hardware is proposed to distribute an execution on a manycore processor. As the parallelization is dynamic, the three problems just mentioned are more easy to tackle. The programming model presented in this paper relies on this core hardware. The next section introduces its ISA 1 extension with the fork instruction, its new way of fetching in parallel to build a totally ordered trace and its parallelization of the renaming process, extended to memory locations.
A Deterministic and Parallel Run of C Code

Deterministic Parallel Execution
The sequential execution of the code on figure 1 is deterministic. The C code fixes a total order that the run follows. The core may run machine instructions out-of-order [14] [6], i.e. in a partial order derived from Read-After-Write (RAW) register dependences, but the total order semantic is preserved.
If a parallel execution is based on a total order, it is deterministic. A totally ordered trace can be built in parallel. For example in the C sum function, the trace can be built top-down. Such an outof-order construction of a totally ordered trace is possible because the control flow instructions are partially ordered. For example in the sum function, the control flow path in the second recursive call is independent from the control flow path in the first recursive call. This means that both calls traces can be built in parallel and orderly connected afterwards.
To avoid complications in the trace building, the hardware in [5] computes the control instructions targets rather than predicting them. Computing is slower than predicting but computing tens of branches in parallel is more efficient than predicting tens of branches in sequence, parallelism being more cost-effective than a sequential predictor, even a perfect one. 1 Instruction Set Architecture
The fork Machine Instruction
The first 18 lines on figure 3 are an x86 translation of the sum C code (AT&T syntax). The call and ret instructions are replaced by fork and endfork. The fork instruction semantic is to keep on fetching along the continuation path, i.e. go to the fork instruction label. Simultaneously, a new core starts fetching along the resume path, i.e. the instruction following the fork in the text. The endfork instruction semantic is to stop fetching along the current path. Contrarily to the call and ret pair, no return address is pushed/popped.
Moreover, the given code assumes the forked path (i.e. the resume code) receives a copy of the stack pointer (i.e. x86 register rsp), meaning that both paths use the same stack area. The hardware in [5] also copies rbp, rdi, rsi and rbx. These copies are better than push/pop because a push in a function prologue and a pop in its epilogue create RAW dependences between the epilogue and the prologue of the next function call, serializing them.
In the code on figure 3 , register rsi and rdi hold arguments n and array. The sum function code leaves the computed sum in register rax (lines 3, 5 and 16), from where it is read by the resume path (lines 11 and 16).
Memory Renaming
In the hardware presented in [5] the trace is run in the partial order of its dependences. False register dependences are removed by renaming [13] . Many true register dependences are also removed by copying. For example, the computation of n − n/2 in line 13 (figure 3) depends on registers set in lines 7 and 8. As rbx and rsi are copied to the resume path, line 13 can be run in parallel with the continuation path in line 1.
False memory dependences are also removed. For example, as the stack pointer moves back and forth (allocation in line 10 and disallocation in line 17), different code portions use the same location, creating Write-After-Read and Write-After-Write memory dependences removed by renaming.
Memory renaming schemes have been proposed [10] [15], mainly to accelerate loads. The renaming relies on a predictor to quickly decide if a load depends on a previous store. However, a prediction based mechanism is not suited to eliminate false memory dependences. In [5] , the memory hardware renaming is based on a search along the instruction trace total order.
Renaming is parallelized. The two recursive call destinations are simultaneously renamed. Destinations can be renamed in any order. Sources can be renamed out-of-order at the conditions that (i) the trace is totally ordered and (ii) all the destinations between a source and its producer are renamed.
The totally ordered trace is built from pieces which are fetched in parallel and later connected. When fetched, each instruction is renamed. A renamed place is allocated to hold the destination and for the sources, the closest producer is looked for by a backward search through the built trace. If a piece of the trace is missing, the search is suspended until the trace is extended. The fetch of next instructions continues during the source renaming search. Figure 4 shows the parallel fetch of the sum function. The fetch is distributed on 11 cores 2 . It is done in 7 successive steps. At the first step, only core 1 is fetching. It fetches the start of the sum function code at line 1 (the instructions are the ones on figure 3 ). The rsi register holds the number n of elements to be summed, i.e. n = 10. Instructions on lines 1 and 2 are fetched and computed, requiring no source renaming as register rsi value is known. The control is transferred to line 7. On figure 4, the upper leftmost rectangle box contains the fetched line numbers (i.e. 1, 2 and from 7 to 9). The i n i t : cmpq $2 , %r s i 
Parallel Construction of the Totally Ordered Trace
; r e t u r n ; } i f ( n == 2 ) { ( * body ) ( f ) ; ( * body ) ( f +1); r e t u r n ; } i f ( n != 0){ f o r r e c u r s i v e ( f , f+n/2 −1 , body ) ; f o r r e c u r s i v e ( f+n / 2 , l , body ) ; Instruction on line 9 (figure 3) is a fork. Core 6 3 (figure 4) receives the resume address from core 1 (i.e. line 10). At step 2, core 1 fetches the continuation path (line 1) and core 6 fetches the resume path (line 10). At step 3, four cores are fetching in parallel. Each core is able to compute its own control path and continue fetching independently from other cores.
The path followed by a core is a section. A section starts after a fork instruction along the resume path and ends when an endfork instruction is reached. On figure 4, there are 11 sections, i.e. one per core. Sections may be long (e.g. a recursive descent like the section on core 1) or short (e.g. the transmission of the partial sum on core 4). On the average, sections are around 10 instructions long resulting in a fine grain parallelization.
The hardware in [5] builds the trace total order by linking the sections. Each section is linked to its successor and predecessor (dashed bidirectional lines on figure 4) 4 . Sections belonging to the same hierarchical level are also linked (plain lines, level predecessor). For example, cores 1, 6 and 11 fetch the highest level of the sum function, i.e. lines 1, 2, 7 to 9 (core 1), 10 to 15 (core 6) and 16 to 18 (core 11). The three involved sections are linked (plain lines). These direct links help finding stack renamings, bypassing the subtrees. For example, instruction 16 on core 11 finds the first half sum on top of the stack, saved by instruction 11 on core 6, without waiting for the construction of the cores 7 to 10 trace sub-tree.
A lot of instructions don't even need any source renaming as their sources have known values. In the sum example, core 1 does not rename any source as a copy of register rsi (i.e. rsi = 10) is received from the sum function caller. Only sources referencing a production of another section need to be renamed. For example, core 6 renames rax on line 11. This renaming is to be provided by core 5 which is core 6 closest predecessor producing rax on line 16. The predecessor link from core 6 to core 5 is established at step 4, when core 5 has reached its endfork instruction (line 18). Core 6 sends a request to read rax to core 5 which returns the rax value to core 6 when it is computed. 3 The core numbers are choosen arbitrarily 4 Successor links serve to export retired data In the sum example, there are 25 renamed sources. Ten of them concern array elements (lines 3 and 5 on figure 3 ). Five are references to the stack (i.e. (rsp), line 16). The last ten are references to register rax (lines 11 and 16).
The renamings of the array elements references deserve a particular treatment to avoid a long distance search of their producer in the trace. This is explained in section 4. The renamings of rax references need a round-trip communication between the renaming section and its predecessor (one way sends a request to read rax and the return way sends rax value). The renamings of the stack references 0(rsp) need a round-trip communication between the renaming section and its predecessor at the same hierarchical level (level predecessor). For example, line 16 in core 11 renames 0(rsp) from producing line 11 in core 6.
Full Renaming and Memory Management
As all destinations are renamed, the trace has a Dynamic Single Assignment form [16] . As described in [5] , each core keeps its renamings in locally allocated storing resources. The cores have no data caches. The intermediate computations are kept in renaming storage until they are freed. Renaming storage is allocated up to the core capacity. When the core is full, its fetch is suspended 5 . Only the final computations are saved to physical memory, when retired. As a result, the processor memory is naturally coherent as there is a single writer.
Such parallelizing cores are simpler than actual speculative cores as they need no branch predictor, no data cache and no vector computing unit. Hence, they are better suited to be the building block of future manycore processors.
Parallelizing Loops
Single for Loops
A single for loop is transformed into a divide-and-conquer tree of parallel iterations. For example, figure 5 is the transformation of the array initialization loop. Figure 6 shows the x86 code produced by the compiler for the for recursive function translating the for loop. Figure 7 shows the distributed run of the loop (only the 5 first iterations are fully shown; the 5 last ones -b5 to b9-are folded; the full tree has 9 steps distributed on 20 cores 6 ). The body calls can communicate (i.e. body(j) can import a value from body(i) for all i < j). Each body section has its own control flow. A parallelizable loop has independent iterations. In this case, there is no communication between the body sections and they all have independent control flows.
Nested Loops
Nested for loops are transformed into two nested divide-andconquer trees of parallel iterations. For example, figures 8, 9 and 10 show the translation of two nested loops initializing a matrix. The x86 code is not shown but is easy to build. The run is fully parallelized with 100 sections organized as a binary tree. It is still easy for iterations of the inner loop to communicate. It is less easy for iterations of the outer loop as sections of intermediate inner loops may form a large separation of a consumer from its producer. In this case, the stack may help (the producer pushes and the consumer pops). Moreover, section 4 presents a general programming technique to optimize inter-sections communications. Figure 11 shows a while loop. Figure 12 shows the transformation of the while loop into a for loop. The for loop is then transformed in a recursive function with a continuation condition (function cont cond). The run deploys a binary tree of l − f + 1 calls to the for break recursive function. The calls are run in parallel.
While loops
In [2] , the authors give many solutions to automatically transform loops that are difficult to parallelize at compile time.
A Programming Technique to Increase the
Quality of Parallelization
The Quality of Parallelization
A well parallelized execution should be composed of many parallel sections (fine grain is better than coarse grain), i.e. the average size sa of the sections should be low. This allows a simultaneous fetch on all the cores, i.e. fixing the Instruction Per Cycle (IPC) peak value to the number of cores. The communications between the sections should be rare and short distance. In other words, the average number of communications per section ca should be low and the average distance da from the producer to the consumer should be short, i.e. the number of visited sections should be low. Eventually, the number of instructions run ni should be low. Among these [ j ]= i * 10+ j ; * / // n e s t e d l o o p s t r a n s l a t e d a s // f o r o u t e r r e c u r s i v e ( f i r s t o u t e r , l a s t o u t e r , // o u t e r b e f o r e , o u t e r a f t e r , // f i r s t i n n e r , l a s t i n n e r , i n n e r b o d y ) f o r o u t e r r e c u r s i v e ( 0 , 9 , empty , empty , 0 , 9 , i n i t ) ; } . . . } Figure 11 : r r a y [ f ] ) , p ) ) r e t u r n f ; i f ( n==1){ ( * body )(& i ) ; * x=a r r a y [ i ] ; r e t u r n i ; p , &x1 , i n c r e m e n t ) ; . . . } Figure 12 : The while loop is transformed into a for loop with a continuation condition four factors, da is the most important one as communications are expensive.
To increase the parallelization quality, we can decrease sa (i.e. either by increasing the number of sections or by decreasing ni). We can also decrease ca and da (increasing the number of sections should not increase the average communication distance). One way to keep da low is to recompute a value each time it is used. This should not increase ni too much (the recomputation should not be complex). Figure 13 shows a main function calling sum. Figure 3 is its x86 translation. The for loop is translated into a parallelized function init. The init function fetched trace has the same sections decomposition as the sum function. Figure 4 can be adapted to the init function fetch. For example, the upper leftmost box corresponds to the fetch of lines 19, 20 and then from 26 to 28. Figure 14 shows the section decomposition of the main function run. A starting section (upper left rectangle) is continued by the figure 4 upper leftmost rectangle (hence, lines 35 to 38 are followed by the init function tree on figure 4 and all belong to the first section). This section forks a new section to start the execution u n s i g n e d l o n g n , l o n g ( * g e t ) ( ) , u n s i g n e d l o n g i ){ i f ( n==1) r e t u r n ( * g e t ) ( a r r a y , i ) ; i f ( n==2) r e t u r n ( * g e t ) ( a r r a y , i ) + ( * g e t ) ( a r r a y , i +1); r e t u r n sum ( a r r a y , n / 2 , g e t , i ) + sum(& a r r a y [ n / 2 ] , n−n / 2 , g e t , i+n / 2 ) ; } s t a t i c i n l i n e l o n g s e t a r r a y e l e m ( l o n g a r r a y [ ] , u n s i g n e d l o n g i ){ a r r a y [ i ]= i ; } s t a t i c i n l i n e l o n g g e t a r r a y e l e m ( l o n g a r r a y [ ] , u n s i g n e d l o n g i ){ s e t a r r a y e l e m ( a r r a y , i ) ; r e t u r n i ; } v o i d main ( ) { l o n g s=sum ( a r r a y , SIZE , g e t a r r a y e l e m , 0 ) ; p r i n t f ( " s=%l d \n" , s ) ; } Figure 15 : Modified main and sum functions with short producer to consumer distance of the sum function (hence, the sum is computed in parallel with the array initialization). The forked section starts at line 39 and continues as the sum function tree on figure 4. Hence, lines 39, 1, 2, from 7 to 9, 1, 2, from 7 to 9 and from 1 to 6 belong to the same section computing array[0]+array [1] .
An Example: the sum Function
There are 105 instructions run for the sum computation and 93 for the array initialization, hence ni = 198 (neglecting the printf run). There are 23 sections, i.e. sa = 198/23 = 8.61 (less than 9 instructions per section on the average).
There are 90 register and memory sources in the init run 7 , all of them having a distance of 1 (i.e. requiring no communication).
There are 122 sources in the sum run, 15 of them having a distance of 2, 10 having a distance 8 of 12 and the remaining 97 having a distance of 1. Hence the average distance is da = 337/212 = 1.59. The average number of communications per section is ca = 25/23 = 1.09 (only 25 sources among 337 need an import from another section, i.e. have a distance greater than 1).
Modifying the sum Function to Enhance the Quality of the Parallelization
To get the array values, function sum sections send renaming requests which travel along the trace, visiting one core per section in the trip, i.e. requiring 12 core-to-core communications. If these distances can be shortened, i.e. if the sum function section consumer can be closer to its init function section producer, the execution time will be reduced. Reducing da enhances the parallelization quality. Figure 15 shows modified main and sum functions. A value is recomputed each time it is referenced rather than being transmitted. In manycore processors, local computation is cheaper than distant communication.
The array initialization is fused in the sum function. The sum function gets each array element. The get function calls a set function which sets the array element. Hence, each time an array element is read, it is also written. This ensures that a consumer and a producer belong to the same section.
As a general rule, a function has new arguments: a pointer on a get function to read variables used in the computation, a structure to encapsulate the arguments of the get function, a pointer on a put function to use computed variables and another structure to encapsulate the arguments of the put function. Figure 16 shows the x86 code translation of the program depicted on figure 15 . Figure 17 shows the run trace of the modified sum program. The number of instructions run is ni = 122 (38% less work). There are 11 sections, i.e. sa = 122/11 = 11.09. The number of sections is reduced and the average size of sections is increased (i.e. the run is artificially less parallel because the reduction of the work has reduced the number of sections). There are 146 sources, among which 15 have a distance 2 and the others have all a distance 1. The average communication distance is da = 161/146 = 1.1. The number of communications per section is ca = 15/11 = 1.36.
The key factor is da, 31% reduced. All sections get their sources locally or from the preceding section, requiring a round-trip coreto-core communication.
Instead of having two separate computations, one to initialize the array and one to compute the sum of its elements, we have a single one, fusing the initialization into the computation. The programming technique which favours parallelization mimics a dataflow style: variables are not initialized and later used but their initialization is delayed until they are used.
When a variable is to be used multiple times, it can either be stored or recomputed. It is the compiler's job to estimate if the recomputation is better (i.e. reduces da enough without increasing ni too much). To be accessed fastly, a scalar variable can be stored in the stack. To look for it, a renaming request travels along the level predecessor links, bypassing sub-trees. 
Improving a Matrix Multiplication Program
Figure 18 presents a non optimized C code to compute matrix multiplication. Optimized versions take advantage of cache locality to reduce the average access time to the input matrix elements. In a parallel environment, very distant elements are simultaneously requested (or with a short time gap) and in this case, the cache does not help. Instead, it is a better policy to keep each producer close to its consumer. , it is even worse because the sections initializing b must be visited before reaching those initializing a, i.e. at least p * n + (m − i − 1) * p + (p − k − 1) + 1 = (n + m − i) * p − k sections. In the example, an element is found after an average of 10.6 visited sections. As it is by far the dominant factor in the number of imported resources, da is around 10 (there are 2 * m * n * p reads to compute m * n * p products).
Figures 19 and 20 are the modified implementation to reduce da, the distance from producer to consumer.
The first part of the code defines two types to encapsulate the needed values for the get and put functions. It also contains Figure 20 : A C program to multiply matrices: matrices multiplication and main functions fine grain dynamic parallelization enabled by the hardware core. Thanks to register and memory renaming allowing producer/consumer synchronizations.
Parallelization based on OS threads suffer from the rather high overhead of OS primitives.
The existing contributions [8] [9] [11] [12] on a hardware approach to automatize parallelization suffer from the low basic Instruction Level Parallelism (ILP). The hardware based parallelization in [5] overcomes this limitation in 3 ways: (i) very distant ILP is caught because fetch is parallelized, (ii) all false dependences are removed through full renaming and (iii) many true dependences are removed by copying values. The remaining dependences in a run are true ones related to the sequentialities of the algorithm which the program implements. In such conditions, the authors in [4] have reached a high ILP (thousands), increasing with the data size, on benchmarks issued from the PBBS suite (parallel applications; available at URL http://www.cs.cmu.edu/ pbbs/index.html).
The number of transistors on a chip allows the integration of thousands of simple cores. It is urgently needed that any program, including the OS itself, be parallelized. Parallelization should be done fastly and reliably, with reproducible computations, which is ensured if the run is deterministic.
The manycore computation model introduced in this paper enable dynamic parallelization using fork machine instruction across function calls and loop iterations. Such fine grain parallelisation is antagonist with the well known locality properties obliging us to introduce the new parallel locality property and parallelization quality. Finaly, we have shown that some program transformations may improve parallelization quality opening a program optimization revisit.
