In this paper we address the issue of efficient doall workload distribution on a embedded 3D MPSoC. 3D stacking technology enables low latency and high bandwidth access to multiple, large memory banks in close spatial proximity. In our implementation one silicon layer contains multiple processors, whereas one or more DRAM layers on top host a NUMA memory subsystem. To obtain high locality and balanced workload we consider a two-step approach. First, a compiler pass analyzes memory references in a loop and schedules each iteration to the processor owning the most frequently accessed data. Second, if locality-aware loop parallelization has generated unbalanced workload we allow idle processors to execute part of the remaining work from neighbors by implementing runtime support for work stealing.
INTRODUCTION
Over the last decade the role of on-chip memory hierarchies has become extremely important, both to hide the ever-increasing gap between processor and main memory speed and to improve the energy efficiency of memory accesses. At the first level of the hierarchy there are fast and low-power SRAM memories, tightly coupled to processors. Processor-to-memory communication leverages highbandwidth and low-latency dedicated connections, which also ensure energy efficient accesses. L1 memories, however, Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. CASES '10, October 24-29, 2010 , Scottsdale, Arizona, USA. Copyright 2010 ACM 978-1-60558-903-9/10/10 ...$10.00.
are very small in size and thus their content must be continuously updated to accommodate the mostly referenced program parts. At the opposite end of the spectrum there is a large, off-chip main DRAM memory, capable of hosting the entirety of program and data items. The coupling between processors and DRAM is very loose. DRAM is accessed through on-chip I/O controllers which exploit sophisticated addressing mechanisms, thus making corresponding accesses slow and energy-hungry. Furthermore, DRAM controllers are shared among processors, which encounters scaling limitations when the complexity of the system increases.
Three-dimensional (3D) stacking technology provides a number of means to overcome the scalability limitations imposed on many-core integrated platform designs as 2D technology reaches the nanometer scale, both in general purpose and embedded computing [9] [8] [7] [12] . Traditional design constraints based on the evidence that the processor and memory sub-systems had to be placed side by side can be overcome in 3D stacking, where they can be placed on top of each other and linked through vertical interconnects based on Through-Silicon Via (TSV) technology which are more than two orders of magnitude more energy-efficient and denser than the most advanced off-chip I/O channels. Focusing on the high-end embedded domain, this groundbreaking technology will enable the construction of multiand many-core data-processing systems with low latency and high bandwidth access to multiple, large DRAM banks in close spatial proximity.
In this work we focus on a concrete embodiment of this model targeting embedded computing, namely a 3D integrated platform for multi-dimensional array processing (e.g. antenna arrays, radar images, video images). We assume an implementation with one silicon layer containing multiple processors organized in a two-dimensional mesh structure (communicating through a Network on Chip), and one or more DRAM layers containing the entire memory subsystem on the top. Following a common trend in the high-end embedded Multi-and Many-Processor System on a Chip (MPSoC) domain, we consider an explicitly managed memory subsystem -based on tightly coupled memories and DMA -which adheres to the Partitioned Global Address Space (PGAS) paradigm. In this memory model each processor has quasi-ideal access to a vertical stack of memory banks in close vertical proximity. Memory transactions towards remote stacks travel through a horizontal on-chip interconnect (NoC), and are thus subject to an increasing cost with distance.
Different from traditional 2D MPSoCs with small local scratchpads (SPM), where frequent update of their content is required to minimize costly accesses to off-chip DRAM, in this work we face the problem of properly partitioning program and data to maximize the accesses to local DRAM stacks. Our focus is on array-intensive applications, structured as a set of doall loops, whose iterations can be independently distributed among processors. A naive assignment of iterations to processors, namely one which is unaware of the architectural assumptions and/or of the task and data mapping, leads to poor locality of memory references and/or load imbalance at runtime. In contrast to frequent cache-or DMA-initiated data transfers to improve locality, we take a different approach of scheduling loop iterations in a localityaware manner instead. Shared array structures in a target program are divided in as many tiles as processors, and tiles are distributed among their vertical DRAM stacks. We developed a compiler analysis pass which is aware of this data layout. All memory accesses within a parallel loop are inspected to determine which particular tile (i.e. which memory) is being mostly referenced at each iteration. The iteration is assigned to the processor hosting the tile. Our compiler statically inserts in the program the definition of local queues to each processor containing the description of work with high locality. Loops are restructured in such a way that at each iteration the work is fetched from these queues. The analysis pass requires that the access pattern performed on arrays is a statically analyzable affine function of the loop iterator. If this is not the case the analysis fails. However, we still allow locality-aware parallelization in this situation through profiling. Our compiler can instrument the program so as to gather access pattern information during a profile run. Profiling information enables the creation of high-locality work descriptors (queues) in case of an irregular application. As explained above, shared arrays are regularly distributed among memories. If the access pattern is not regular, or the loop iteration space does not overlap with the data space, it is possible that a subset of the tiles is accessed more frequently than the rest. This ultimately leads to assigning more iterations to the processor(s) holding these tiles. Stated another way, the queues may contain non-uniform amounts of work among processors, thus leading to unbalanced execution time. To address this issue we augment our technique with runtime support for work stealing. Idle processors are allowed to steal part of the remaining work from remote queues in a locality-aware manner, thus achieving balanced execution and locality of references.
We compare our techniques against traditional data distribution or dynamic scheduling policies, and demonstrate that an integrated approach such as ours achieves better results. Combining locality-aware loop scheduling and work stealing leads to ≈50% speedup against data distribution for balanced loops where locality of references dominates the execution time, and ≈69% speedup against dynamic scheduling.
The remainder of the paper is organized as follows. In Section 2 we discuss related research to ours on programming abstractions for NUMA machines and 3D MPSoCs. The target architectural template and its memory model is discussed in details in Section 3. Our locality-aware parallelization and stealing techniques are described in Section 4, whereas experimental setup and results are presented in Section 5. Section 6 concludes the paper.
RELATED WORK
Recently, several 3D memory designs have been announced, confirming the benefits of 3D technology for high-efficiency next-generation memory systems [10] [11] . Kgil et al. [12] present a high performance server architecture where DRAM is stacked on a multicore processor chip. Overall power improvements of 2 − 3× with respect to a 2D multi-core architecture are reported. Similarly, [14] presents a 3D stacked memory architecture for CMPs. By changing the internal DRAM architecture the author claims a 75% speedup. Industry leaders IBM and Intel are active in technology and architecture exploration [7] [3]. Li et. al investigate in [13] the challenges for L2 design and management in 3D chip multiprocessors. Their term of comparison is 2D NUCA (Non-Uniform Cache Access) systems, which employ dynamic data migration to place more frequently-accessed data in the cache banks closer to the processor. Experiments show that a 3D L2 memory design with no dynamic data migration generates better performance than a 2D architecture that employs data migration.
3D memory integration is also actively explored in the embedded computing domain. All major players in the mobile wireless platform markets are very actively looking into how to integrate memories on top of MPSoC platforms for next-generation hand-held terminals [9] . More in general, the system size reduction, coupled with orders-of-magnitude improvements in memory interface energy efficiency are key enablers for disruptive innovation in embedded computing [8] , possibly even more than in performance-centric generalpurpose computing. In [17] Ozturk et al. explore core and memory blocks placement in a 3D architecture with the goal of minimizing data access costs under temperature constraints. Using integer linear programming, the best 2D placement vs the best 3D placement are compared. Experiments with single-and multi-core systems show that the 3D placement generates much better results (in terms of data access costs) under the same temperature bounds.
Concerning the parallelization techniques that we propose in this paper, useful background work regarding the implementation of stealing policies can be found in [18] [1] [4] , whereas related research on data distribution is presented in [5] [2] [6] [16] . The main similarities between the techniques there proposed and ours are to be found in the language and programming model abstractions. Indeed, several patterns proposed in the past to efficiently program NUMA machines can be successfully adopted in the context of 3D MPSoCs. From the implementative point of view, however, the radical architectural differences between these machines require an in-depth reassessment of such techniques, based on the availability of a completely different hardware and software support for their construction.
A two-step approach to efficient loop parallelization, which -like ours -leverages static compiler analysis to schedule iteration in a locality-aware manner and runtime support for load balancing is proposed by Xue et al. in [19] . Unlike ours, however, their approach is targeted at cache-based machines, and thus their analysis aims at grouping together iterations which show some data reuse. Our technique considers distributed shared memory MPSoCs with NUMA latencies, where data distribution is employed instead of caching. In this context, efficient locality-aware loop scheduling is rather aimed at allocating each iteration to the processor owning the data set therein referenced. 
TARGET ARCHITECTURE AND MEMORY MODEL
The platform template targeted by this work is the 3D-stacked MPSoC depicted in Fig. 1 . The bottom layer hosts the 2D multicore subsystem, whereas the topmost layer(s) consist of DRAM memory banks [14] . Processing elements (PE) on the multicore die feature a core tile, composed by a RISC-like CPU, a small amount of local L1 memory (SPM, caches) and a DMA engine. Each PE also hosts a set of local hardware semaphores -implemented as a bank of registers with test-and-set read semantics -and a fast DRAM controller with TSV DRAM physical interface for vertical communication to upper layers. Transactions towards remote memory neighborhoods are routed out of the PE by a Network Interface (NI), which injects them through the onlayer network (NoC) for horizontal communication. All the described IPs are interconnected through a crossbar, which is also in charge of determining whether memory references issued locally are to be transported vertically or towards the outer world.
The memory subsystem leverages a Partitioned Global Address Space (PGAS) organization, and is thus accessible from the bottom layer by every tile through the described heterogeneous 3D interconnection. All of the on-chip memory modules are mapped in the address space of the processors, globally visible within a single shared memory space, as shown in Fig. 2 . Despite this unique view of the memory space, each PE has a certain amount of tightly coupled physical memory, which we refer to as the processor's memory neighborhood, and that is organized as a two-level hierarchy. L1 memory within each PE features separate instruction and data caches, plus scratchpad memory (SPM). Moreover, each PE is logically associated to a vertical stack of local L2 DRAM memory. The latter is logically organized in two parts. A shared segment (which constitutes part of the global shared memory), plus a (conceptually) private segment, where by default program code and private data to the core are allocated.
L1 caches in this template are non-coherent, as hardware cache-coherence protocols are very expensive in terms of area, and scale poorly. To prevent inconsistencies, we only allow private data and code to each processor to be cached. The logically private segment on each memory neighborhood is the only one that can by default be cached. Shared segments can only be directly accessed through the processor or DMA. It is a programmer's responsibility to deal with coherency issues in case multiple copies of shared data are allowed. Similarly, if shared data is allowed to be cached, appropriate actions (e.g. flushes) must be taken in software.
VERTICAL STEALING
Our 3D MPSoC leverages a Partitioned Global Address Space (PGAS) organization of the memory subsystem, which has some affinities with traditional Cache Coherent Non Uniform Memory Access (CC-NUMA) multiprocessors (e.g. the SGI Origin). Such machines typically contain a large number of processing nodes each with one or more processors and a portion of main memory connected through a scalable interconnection network. Although global memory is uniformly accessible by all the processors, remote memory latencies are typically much larger than local memory latencies. To obtain high performance on CC-NUMA machines is often necessary to distribute the data structures in the program so as to maximize the number of cache misses of each processor that are satisfied from local rather than remote memory. Data distribution (a.k.a. array partitioning) splits main arrays in the program in a set of tiles, which can be independently mapped on different memory nodes. Language abstractions and compiler techniques to enable data distribution in a program have been proposed in the past [5] [2] for CC-NUMA multiprocessors. Two common array distribution strategies are block and cyclic. Block distribution splits arrays in as many tiles as nodes thus assigning equallysized tiles to each processor. Cyclic distribution allows the programmer to specify a partitioning granularity (i.e. a tile size). In both cases tiles are dealt out to processors in a round robin fashion. We describe our implementation of such facilities in Sec. 4.1.
To achieve high data locality it is important that a thread Let us consider a target architecture composed by 4 processors. In the example above the array A is block distributed in 4 tiles of 4 elements each among the available memory neighborhoods. The array A is indexed with the loop induction variable i. This regular access pattern is amenable to static loop parallelization, where consecutive iterations are folded in chunks of 4 and assigned to processors in a round robin fashion. In this simple example there is perfect affinity between each thread and the referenced dataset. When more complicated access patterns are executed block distribution fails in delivering good locality. To solve this issue arrays should be re-distributed, in an attempt to match the array access pattern exhibited by the running thread. To re-distribute arrays we adopt DMA transfers, which update the content of each memory neighborhood. However, this solution suffers from two main issues. First, the array access pattern in a program may change frequently (e.g. across different loops). Trying to re-distribute arrays in memory accordingly may thus require high amounts of DMA transfers. Second, 3D technology enables big amounts of memory to be tightly coupled to PEs. As a consequence, large array tiles can be entirely hosted on memory neighborhoods.
Frequently moving such large data blocks is likely to compromise the benefits of improved locality.
An alternative approach to moving data may be that of scheduling loop iterations to processors in a locality-aware manner. More specifically, it is possible to leverage compiler analysis of array accesses in a loop to determine which physical memory is mostly accessed at a given iteration. The iteration is then scheduled to the processor owning that memory. Our locality-aware parallelization technique performs such analysis and builds work queues containing highlocality tasks (i.e. iteration descriptors) for each processor. A detailed description of the technique is given in Sec. 4.2.
Locality-aware loop parallelization does not require to move array tiles. Given an initial (e.g. block ) distribution all loops are re-structured in such a way that each processor is assigned the iterations that insist primarily on the tiles hosted on the local memory. A clear drawback of this policy is that processors may be assigned a different number of iterations, thus possibly leading to load imbalance among parallel threads. Let us consider the following example.
# define ROWS 16 # define COLS 16 int pix [ ROWS ][ COLS ]; # pragma omp d i s t r i b u t e d ( pix )
/* Loop to p a r a l l e l i z e */ for ( i = 7 ; i < ROWS ; i ++) for ( j = 7 ; j < COLS ; j ++)
The matrix pix is block distributed among four available memory neighborhoods. Each memory hosts a tile of 64 elements. Corresponding cores must be assigned the iterations of a loop which operates on a subset (lower loop boundaries are greater than zero) of the matrix. Fig. 3 highlights the part of the array which is accessed in the loop, and the corresponding layout in memory. If locality of accesses drives parallelization, a different number of iterations is assigned to each core. Fig. 4 shows that only one element from tile 0 -which is hosted on Memory neighborhood 0 (MEM 0) -is accessed in the loop, thus processor 0 will be assigned a single loop iteration. On the contrary, all elements belonging to tile 3 -hosted on Memory neighborhood 3 (MEM 3) -are accessed in the loop. Processor 3 will be assigned 64 iterations, thus leading to load imbalance. Remote accesses on our MPSoC are subject to an increasing cost with the distance (i.e. the number of hops traversed in the NoC). However, different from CC-NUMA machines all the communication travels through tightly coupled layers, and thus fetching remote data on our MPSoC is much cheaper than an equivalent access on CC-NUMAs. Consequently, to solve the load imbalance issue we can afford the cost to allow idle processors to execute iterations originally assigned to other cores. Even if the stolen work has poor locality, the increased cost for remote references may still be repaid by load balancing. We describe in Sec. 4.3 the implementation of our runtime support to work stealing.
Array Partitioning and Distribution
We
The tilesize parameter is used to specify the granularity of partitioning, namely the size -expressed in terms of array elements -of the elementary tile. The allocation policy follows a cyclic distribution scheme. Block distribution can be triggered by properly tuning the partitioning granularity so as to generate a number of tiles equal to the number of processors. This is automatically done by default if no tilesize parameter is given.
The primary concern when distributing data on CC-NUMA architectures is that physical placement of data must be performed in units of an operating system page, thus constraining the granularity of partitioning. If array tiles are much smaller than a page size data items that we wish to place in local memories of distinct processors may lie within the same page. This situation leads to false sharing, and requires expensive data transfers within the virtual address space of the process to map different tiles to distinct pages. In alternative, data padding at the page level can be applied as a workaround, but this leads to significant memory wastage. The scenario is quite different for our MPSoC. No specialized MMU hardware or OS support to virtual memory management is available, thus data partitioning is implemented in a lightweight manner by means of software address translation. Accesses to arrays annotated as distributed in the program are instrumented by our compiler with necessary instruction to locate the correct memory neighborhood at runtime as shown in Table 1 . In this table t is the size of a tile for the distributed array A, and tiles_A is a compilergenerated metadata array containing the base address for each tile of A. Indexing this array with a tile ID returns the base address for that tile. The ID of the tile being accessed is simply obtained by dividing the current offset (i.e. the array index) by the tilesize t. Once the base address of the target tile has been retrieved a modulus operation between the same operands returns the offset of the reference within the current tile.
The availability of such an efficient and streamlined implementation of the necessary support to data distribution allow us to partition arrays at arbitrary granularities without 
Locality-Aware Loop Partitioning
Data distribution schemes such as block or cyclic attempt to capture the most common array access patterns in loop-intensive applications. However, the access pattern may change at different points in a program (e.g. at different loops or parallel regions). For this reason, it is necessary that the programming model or the compiler allows to re-distribute arrays across different regions or to schedule loop iterations under a certain affinity with the current layout of array tiles in memory. Continuously re-distributing arrays may lead to a high number of data transfers. On our platform, the amount of memory made available by 3D stacking allows each memory neighborhood to accommodate big-sized tiles, whose frequent movement is likely to significantly impact the performance. Affinity-based loop scheduling techniques appear therefore more suitable to address the described issues. An example of such a technique is the Owner-Computes rule from the High-Performance Fortran (HPF) compilation system, which after distributing the ownership of array elements to the processors, distributes the charge of executing each instruction to the processor owning the variable modified by this instruction (i.e. the Left-Hand Side expression of an assignment statement). This may still lead to high amount of communication, since components of the Right-Hand Side expression may have to be communicated to the owning processor before the assignment is made.
Therefore we rather look at a locality-aware parallelization strategy which assigns an iteration to the processor whose memory neighborhood hosts the most frequently referenced array tile(s) within that iteration.
To this aim, we developed a static analysis pass in our compiler, which is described in Sec. 4.2.1. This analysis can be applied to counted loops whose array subscripts are affine functions of the loop iterator. In case the loop does not satisfy such requirements, we still allow locality-aware parallelization by leveraging profile information (Sec. 4.2.2).
Static Analysis
The static component of our compilation strategy operates on the following setup. Architectural information required is the number N of processors, to which corresponds a set of m associated memory neighborhoods. m ∈ M = {m1, .., mN } Let L be the set of loops in a program, and D the set of distributed arrays. Each loop l ∈ L has an associated iteration space I l . Within the loop body executed at each iteration i ∈ I l , a number K ∈ N of accesses to distributed arrays is performed. Every access can be characterized with a subscript function s d,j , where d ∈ D and j ∈ {1, .., K}. These subscripts must be an affine function of the loop iterator i,
where a, b ∈ N. This ensures that the compiler can determine the exact offset at which the target array is accessed. Remind from Sec. 4.1 that a distributed array declaration conveys to the compiler information about the partitioning granularity (i.e. the size of a tile). Based on this information, and on the offset described by the subscript function, every array access can be brought back to a specific tile, and -finally -to the memory hosting that tile. If tSIZE is the size of a tile for the current distributed array, the ID tID of the tile being accessed can be determined as follows:
Since cyclic distribution is triggered by the compiler for distributed arrays, it is possible to determine on which memory neighborhood mID a given tile is mapped to.
In short, if we indicate with S the set of all the subscripts representing array accesses, we define a map function that associates each array access to a physical memory.
map : S → M
Statements within the loop body are walked, and every array access found is analyzed as discussed. The outcome of this analysis step is a multiset Mi, which describes the cardinality n(m) of each memory m accessed in the iteration i.
The memory with the highest cardinality is the one with the highest affinity to the current iteration, which is then assigned to the processor owning the memory.
The described analysis has been implemented within the OpenMP expansion pass in the GCC 4.3 compiler, and is triggered by the use of the custom locality scheduling clause for the original OpenMP #pragma omp for loop parallelization directive. At the end of the analysis a queue containing the description of work with high locality is created for each processor and for every loop. The original loop code is transformed as shown in the code snippet below. As we will explain in Sec. 4.3, the queues are managed through head and tail pointers which reside in each processor L1 SPM for fast inspection.
At the beginning of a loop the corresponding queue is properly hooked to local pointers through the omp_init_queues function. After each queue has been copied locally, the work is fetched from there with the omp_get_chunk function, which extracts part of the remaining work in the queue at each invocation. The size of the chunk of iterations extracted can be defined by the programmer. The range of queue elements to be processed upon this invocation is described as lower and upper bounds (lb, ub) of a compiler-generated loop with. The original loop iteration i is fetched from the queue through the omp_get_iteration function and passed to the loop body for execution.
Profile-Based Analysis
We now enlarge the scope of applicability of our approach to benchmarks containing non-statically analyzable array accesses. Our compiler can instrument the program so as to collect information about which memory is mostly referenced within each iteration during a profile run of the program. If the custom -fomp-profile flag is given, the compiler emits instructions that generate a trace of the array accesses. These are captured at runtime by a script, which collects them into per-iteration access descriptors, and are passed to a queue generator. The queues are described as a standard C array within a header file, which is later included for compilation during the second program run and linked to our enhanced OpenMP runtime library. The entire flow for the profile-based locality-aware parallelization is shown in Fig. 5. 
Runtime Support for Work Stealing
As introduced in the previous section, locality-aware loop parallelization is based on work queues that describe which iterations are assigned to each processor at a given loop. Every processor fetches its assigned iterations from these queues for execution in chunks. If an attempt to extract work from a local queue fails, meaning that either no iterations were assigned to the processor by the locality-aware parallelization pass or all pre-assigned iterations have been processed already, then a work stealing policy may be triggered so that an idle processor can transfer part of a remote queue to its local descriptor and continue working.
We allow the programmer to enable this kind of scheduling by associating the custom stealing scheduling clause to an OpenMP loop Our compiler restructures the original loop as two nested loops, a work loop and a steal loop. Each processor owns two local queues, a Non-Stealable Queue (NSQ) containing the set of iterations being currently processed, and a Stealable Queue (SQ), visible to other stealers. From an implementative point of view this double-queue system leverages a single multi-indexed memory region (the work queue), where head and tail pointers to stealable/non-stealable elements are updated through lock-protected operations. The work queue resides on main (DRAM) local memory, whereas control pointers are allocated on SPMs for fast inspection, as shown in Fig. 6 .
Each processor attempts to fetch work from the local queue through the omp_get_local_chunk function, which sets lower and upper bounds for the work loop. The function returns the size of the extracted work chunk. In case the chunk size is zero, no local work is left to do, and thus a steal operation is attempted. Since we never move array data, allowing a processor to steal work from other cores breaks the locality contained in the original work assignment. In case of memory-bound parallel loops allowing an idle processor to steal work from far away processors is likely to significantly degrade performance due to the high number of costly remote accesses. However, stealing only from nearby processors may still be beneficial. Based on this rationale, we implemented a stealing policy in which the stealer can only fetch work from processors within a given distance. The programmer can annotate a maximum steal range (specified as the maximum allowed number of hops to look for stealable work) to the schedule(stealing) clause. Our stealing policy is implemented within the omp_steal library function. /* D e t e r m i n e d i s t a n c e b e t w e e n s t e a l e r and v i c t i m */ int distance = LUT (i , pid ); /* C u r r e n t v i c t i m is f a r t h e r than m a x i m u m a l l o w e d steal d i s t a n c e */ if ( distance > range || ! distance ) continue ;
int chunk = o m p _ g e t _ r e m o t e _ c h u n k (i , lb , ub ); if ( chunk ) return chunk ; } } return -1; } Upon entrance into the function, each processor annotates in the shared variable local_done the information that it has no more work in its local queue. A stealer then continuously considers other processors as possible victims. First, the distance (in number of hops) between the stealer and the victim is inspected from within a lookup table (LUT). If the distance is out of the allowed range the victim is discarded, otherwise it is a good candidate for the steal operation, which is triggered by a call to the omp_get_remote_chunk function. The loop continues until every processor has entered at least once the omp_sleep function, thus signaling that no processor has local work left to perform.
EXPERIMENTAL RESULTS
We describe in this section the experimental setup used to evaluate our programming framework and the results obtained.
The OpenMP-based programming framework with the proposed extensions was implemented within the GCC 4.3 compiler (GOMP). The runtime environment (libgomp) adopts a MPSoC-specific implementation [15] which does not leverage OS support nor thread libraries. Each OpenMP thread is pinned to a given processor based on its ID. The library code is executed by every core. At system startup the processor with the highest ID is designated as the master processor, and it is responsible for orchestrating parallel execution by synchronizing slave processors and pointing them to parallel code and shared data.
We implemented an instance of the 3D platform template presented in Sec. 3 within a SystemC full system simulator. We simulate a 3D chip composed by three layers. The bottom level hosts 16 processor tiles, while L2 memory stacks (16 MB each) reside on the topmost two layers, respectively devoted to the shared and "private" segments. On-tile L1 memory features 16KB scratchpad memory (SPM) plus separate data (4KB) and instruction (8KB) caches. It is worth recalling here that caches only manage private data, therefore preventing any coherence issues. Fig. 7 shows how PEs are placed on the CMP die. Processor IDs increase with the pattern indicated by the arrow. Because of OpenMP's master-slave execution paradigm, the program starts as a single thread of execution. All data declared out of the scope of parallel constructs is by default allocated on the memory neighborhood of the master core. Therefore, slave cores will sometimes need to communicate through this memory stack. To minimize the effect of the NUMA latencies seen by different slaves, the master core is kept in a central position in the CMP die.
The memory access time depends on the transaction path. Accesses to local SPM are subject to only 1 cycle latency. For remote SPMs this cost depends on the internal memory interface latency (≈ 2 cycles), the number of hops to the target memory controller, the contention level on the network, the neighborhood interface latency (≈ 2 cycles), the neighborhood memory latency (1 cycle for SPM, ≈ 5 cycles for 3D stacked DRAM).
The network on chip on the CMP die is based on the ST STBus protocol. The zero-load NoC latencies for remote accesses depend on the number of traversed hops, and are modeled as shown in Fig. 8 . L is a parameterizable value which represents the cost to traverse a single hop. For example, if L = 10 in absence of contention accessing data on the memory neighborhoods of processors 4, 14 or 10 from processor 12 is subject to a latency of 20 cycles. If interconnect resources are shared with other concurrent transactions, the latency will be higher. To test the effectiveness of our techniques we consider a synthetic benchmark (synth) and 3 representative application kernels, namely: 
IDCT and Luminance Dequantization
Results for IDCT and Luminance Dequantization (LD) kernels are shown in Fig. 9 . These are two kernels extracted from a JPEG decoder, which operate on an image composed by 600 DCT blocks. The two main differences between these kernels consist in the access pattern -which is regular for LD and scattered for IDCT -and in the number of accesses performed within each iteration, which is much bigger for IDCT (384 vs 64). Arrays are partitioned with block distribution, but since the number of memory neighborhoods does not evenly divide their size some processors own larger tiles than others. Similarly, the number of processors does not evenly divide the number of loop iterations (i.e. DCT blocks), and thus cannot capture with exact precision the affinity between iterations and tiles. For this reason, as L increases the performance of static scheduling decreases. Similarly, even if a certain (small) amount of unbalancing is present in this loop, dynamic scheduling is overwhelmed by the cost for remote references generated by this locality-agnostic parallelization scheme. On the contrary, our locality scheduling exactly establishes the affinity between an iteration and the corresponding array tile, as is confirmed by the fact that its execution time does not change for varying L. For a realistic value of L = 10, in the IDCT kernel locality scheduling is 21% faster than static scheduling, and stealing is up to 40% faster than static and 50% faster than dynamic. In the LD kernel locality is 50% faster than static, and stealing is up to 56% faster than static and 69% faster than dynamic. 
Matrix Multiplication
Results for the Matrix Multiplication kernel are shown in Fig. 11 . This benchmark is amenable to static loop parallelization, which generates the iteration space partitioning shown in the plot on the left in Fig. 10 . Block distribution accommodates array tiles in memory so as to exactly match the threads footprint (see plot in the middle in Fig. 10 ). We want to force a worst-case array distribution for the static loop scheduling, such as the vertical blocking shown in the plot on the right in Fig. 10 . The plot in Fig. 11 shows a very interesting result. As expected, block distribution associated to static scheduling delivers excellent performance because of the high amount of local accesses and the low scheduling overhead. For L = 10 it is possible to notice that our locality scheduling performs equally well. Work stealing can not do any better since the loop is highly balanced, but it does not degrade much the performance, thus indicating that our techniques introduce a very low overhead. dynamic scheduling, which lacks any locality awareness, significantly degrades performance as L increases. When employing the unfavorable vertical block data distribution scheme, static scheduling worsens. It can be seen that our locality scheduling is insensitive to the data distribution scheme applied, thus delivering the best results. stealing scheduling does slightly worse, since -as already pointed out -the loop is well balanced, and thus dynamic techniques are not beneficial, and only add overhead.
Synth
The aim of this synthetic benchmark is that of forcing the generation of work queues which describe a very unbalanced loop scheduling (i.e. most iterations are assigned to a single processor). In this way we can study the effect of the range parameter on our stealing techniques. We explore how the performance of work stealing changes when we allow work to be stolen from N-hop distant processors, with N ∈ {1, 2, 4, 6}. In this benchmark an array of 16K elements is block distributed in 16 tiles of 1024 elements. A parallel loop with 1074 iterations accesses the first 1074 elements of the array, 1024 of which are contained in the first tile, and the remaining in the second tile. It it therefore clear that the locality-aware parallelization assigns 1024 iterations to the first processor, 50 to the second processor, and none to the other processors. Results for this experiment are shown in Fig. 12 . Unsurprisingly, the locality scheduling performs poorly, since most processors are idle. When L = 15 the high cost for remote accesses renders static scheduling even slower. dynamic scheduling randomly assigns iterations to ready processors without caring about how costly the consequent communication will be. For this reason its performance degrades as L increases. stealing scheduling provides the best results, since it starts from an original mapping with high-locality, and then manages the imbalance by dynamically re-distributing the workload. It is possible to notice that the best results are achieved when range=2. Recalling that all iterations are originally assigned to pro-cessors 0 and 1, and considering the position of these processors in the CMP layout (cfr. Fig. 7) , it is evident that for smaller values of range only two processors are allowed to steal from processor 0. On the other hand, when bigger values of range are allowed the cost for remote accesses dominates the benefits of stealing.
CONCLUSION AND FUTURE WORK
In this paper we investigated the integration of a localitybased approach to loop parallelization with runtime support to work-stealing techniques as a convenient programming abstraction for 3D integrated embedded manycore platforms. Our compilation strategy is based on a first analysis which associates an iteration to the processor which owns the referenced data. In case such analysis cannot be carried out at compile time, we exploit profile information to achieve the same result. At runtime idle processors are allowed to steal part of the remaining work from remote queues in a locality-aware manner, thus achieving balanced execution and locality of references. Results on a set of data-intensive kernels underline the effectiveness of locality-aware parallelization. Work stealing appears to be less beneficial on these benchmarks, where -however -computation among parallel threads is inherently balanced. We thus believe that stealing would allow much more significant speedups for imbalanced applications, which we will consider in future work. Furthermore, it is important to better understand the tradeoffs between the benefits and costs of stealing techniques. We plan to characterize in details the key parameters that influence these trade-offs, so that a compiler can automatically determine when stealing is beneficial, and within which maximum distance (range).
Moreover, we are implementing techniques to dynamically (i.e. at runtime) determining the affinity between a given iteration and a target memory neighborhood. This solution will allow to exploit our techniques even when a loop is not statically analyzable and profiling can not be exploited.
ACKNOWLEDGMENTS
This work was supported by projects FP7 PRO3D (248776), SHARE (224170) and JTI SMECY (ARTEMIS -2009 -1 -100230), funded by the European Community.
