Abstract-Modern embedded systems include hundreds of cores. Because of the difficulty in providing a fast, coherent memory architecture, these systems usually rely on noncoherent, non-uniform memory architectures with private memories for each core. However, programming these systems poses significant challenges. The developer must extract large amounts of parallelism, while orchestrating communication among cores to optimize application performance. These issues become even more significant with irregular applications, which present data sets difficult to partition, unpredictable memory accesses, unbalanced control flow and fine grained communication. Hand-optimizing every single aspect is hard and time-consuming, and it often does not lead to the expected performance. There is a growing gap between such complex and highly-parallel architectures and the high level languages used to describe the specification, which were designed for simpler systems and do not consider these new issues.
I. I Modern Embedded Systems (ESs) are based on MultiProcessor Systems-on-Chip (MPSoCs). Architectures such as Tilera's TilePRO 64 or TileGX, Kalray's MPPA or Adapteva's Epiphany integrate hundreds or thousands of simple cores, interconnected together through fast Networkson-Chip (NoCs). Because of the large amount of cores, these architectures usually exploit distributed memory hierarchies, with fast, non-coherent scratchpads connected to each core. This allows increasing the number of cores, while limiting circuit complexity and power consumption. Nevertheless, it also makes these architectures much more complex to program [1] . The developer can (and should) exploit an unprecedented amount of parallelism, while guaranteeing load balancing to reach the target performance. Even coherent manycores, such as the Tilera processors, employ distributed cache designs, which provide significant performance increases by exploiting localization.
Many emerging classes of embedded applications, such as computer vision, machine learning and data mining problems, are irregular [3] . They exploit dynamic, linked data structures such as graphs, unbalanced trees or unstructured grids. Such applications are inherently parallel, since they potentially perform parallel computations for each element of the data structures. Thus, they may seem a good fit for modern massively parallel MPSoCs. However, the same data structures are subject to unpredictable, finegrained accesses, and are very difficult to partition in a balanced way. They have almost no locality, and present high synchronization intensity. Conversely, modern MPSoCs rely on regular computation and locality exploitation to reach the peak performance. Developing irregular applications on them poses complex challenges and require significant programming efforts. Developers usually exploit high-level languages for describing the specification in ESs design. These C-like languages do not consider many of the issues introduced by manycore architectures. For example, they do not properly expose the inherent parallelism of the specification, which is critical to achieve the expected performance for architectures relying on hundreds of weak cores. Furthermore, they generally assume shared memory architectures, thus considering a common and coherent address space across cores. Alternatively, they assume accelerator-like architectures, thus completely offloading the computation to custom processors, copying data with bulk transfers only at the beginning and at the end of the computation. Even existing parallel languages extensions for shared memory multiprocessor systems, such as OpenMP, are not sufficient. In general, they consider applications and architectures with limited amounts of parallelism. Furthermore, they do not usually expose locality, thus not coping well with non-uniform or distributed memory systems. The complex behavior of irregular applications further complicates the problem, making current high-level languages ineffective and forcing developers to spend significant amounts of time in optimizations.
In this paper we introduce YAPPA (Yet Another Parallel Programming Approach), a compilation framework, based on the LLVM compiler, for the automatic parallelization of irregular applications on modern MPSoCs. Considering the features of irregular workloads, we briefly introduce an efficient parallel programming approach for these applications on non-uniform, non-coherent distributed memory systems, such as the latest generation manycore embedded processors. Our approach consists in a runtime library that enables a global address space across all the scratchpad memories of the system. It provides latency tolerance through lightweight software multithreading while supporting a simple fork/join parallel control model for dynamic parallelism management. The high performance community explored some of these solutions with both software [6] and hardware approaches [5] . We propose a set of compiler transformations for the automatic parallelization, which can reduce development and optimization effort, and a set of transformations for improving the performance of the resulting parallel code, focusing on irregular applications. We implemented these transformation in LLVM and evaluated a first prototype of the framework on a common irregular kernel (graph Breadth First Search).
The remainder of this paper is organized as follows. Section II describes some related work. The proposed compiler transformation for the automatic generation of parallel code is presented in Section III. In Section IV we evaluate the performance of the parallel code generated and the efficiency of the proposed optimizations. Finally, Section V concludes the paper.
II. R W Embedded systems designers typically use C-like languages to describe the specification, mainly because of their simplicity. The main strength of these languages is the hiding of the architectural details. However, they do not properly fit the increasingly complex and parallel manycore architectures, because they abstract too many important details for maximizing the performance of these systems. Several extensions for better expressing parallelism in standard languages have been proposed. The most well known example is OpenMP [2] , which extends C/C++ and Fortran programs with task-parallel language constructs, that developers insert in the code in the form of pragmas and API calls. OpenMP targets shared memory multiprocessors with uniform memory architectures, abstracting all communication operations. The model works well for architectures with few cores, but does not provide adequate scalability for modern MPSoCs, which relies on a multitude of cores, tightly interconnected to private, non coherent scratch-pad memories to exploit locality. Recently, modifications to OpenMP compilers that enhance the support of these architectures have been proposed [8] . Nevertheless, they still require knowledge of the underlying architecture, to trigger data partitioning.
At the other end of the spectrum, distributed memory systems are naturally programmed with message passing programming models. These models can effectively map the architecture of novel embedded many-core processors, which exploit NoCs for inter-core communication. The Message Passing Interface (MPI), is the de facto standard for message passing, especially for high performance systems. Embedded systems usually implement thin message passing layers that directly interface with the NoCs. There are various efforts that look at supporting the MPI stack on embedded systems [17] , reducing its resource requirements. With message passing programming models, the programmer usually exploits Single Program, Multiple Data (SPMD) control models, where, at the beginning of the application, each core is associated with a process that operates on its own chunk of data. Communication usually happens only in precise application phases. Developing irregular applications with these programming models is not easy. Irregular applications employ datasets difficult to partition, thus shared memory abstractions are preferred. They also present data accesses to unpredictable location, and dynamically spawn new concurrent activities as the data is explored [4] .
The high performance computing community has introduced the Partitioned Global Address Space (PGAS) programming model as an approach to support a shared memory abstraction over distributed memory machines without neglecting data or thread locality. The PGAS programming model is implemented in languages such as Unified Parallel C (UPC) [6] , Co-Array Fortran [9] , the Global Arrays (GA) Toolkit [11] , X10 [13] and Chapel [12] among several others. These programming models rely on communication run-time libraries which manage data exchange between distributed address spaces, such as ARMCI [10] and GASNet [7] . The UPC compiler automatically transforms UPC programs (C programs with parallel annotations) in C programs that employ the underlying communication layers. However, it still targets a SPMD control model, thus resulting inadequate for irregular applications. X10 and Chapel, on the other hand, support asynchronous task execution and the possibility to reason about data locality. Although providing several concepts replicable on embedded systems, these PGAS languages and libraries are too complex for MPSoCs. OpenSHMEM [14] is a specification standardizing an increasing number of SHMEM implementations, a communication library that uses one-sided communication and a global address space. TSHMEM is a lightweight SHMEM implementation for Tilera processors. However, also TSHMEM [15] still implements a SPMD control model and misses some of the features that may enable more efficient execution of irregular applications. Our objective is to introduce a set of compiler transformations for a global address space library optimized for irregular applications that, similarly to TSHMEM, maps on MPSoCs.
Beside a shared memory abstraction, other features that allow a more efficient execution of irregular applications are multithreading and fork/join control model. Multithreading allows tolerating, rather than reducing, the latencies for accessing data, both in local and remote locations. The fork/join control model enables dynamic creation of fine grained threads, allowing better exploitation of the inherent applications' parallelism. In the high performance environment, these features can be found on the Cray XMT [5] , which is the successor of the Tera MTA and Cray MTA-2 multithreaded supercomputers. These custom machines for irregular applications also provide a custom compiler, which extracts fine grained threads from nested loops. Parallelism discovery happens semi-automatically through pragma annotations. We aim at designing a compiler infrastructure for irregular applications with similar functionalities to those of the Cray XMT's compiler. However, we target embedded many-core with private scratch-pads, rather than large scale Table I : The LGMT Library API exposed to the programmer/compiler Primitive Description gmt data t gmt alloc ( uint64 t size, allocPolicy t allocPolicy ) Allocate space in the virtualized global address space with the specified allocation policy (local, remote, partitioned) void gmt free ( gmt data t gmtArray )
Free space in the virtualized global address space void gmt waitCommand( ) Sinchronization primitive: wait for completition of previous non-blocking operations void gmt put ( gmt data t gmtArray, uint64 t offset, const void * data, uint64 t size ) Blocking-write a local array in the virtualized global address space starting from the specified offset void gmt put NB ( gmt data t gmtArray, uint64 t offset, const void * data, uint64 t size )
Non-Blocking version of the gmt put void gmt putValue ( gmt data t gmtArray, uint64 t offset, const void * data, uint64 t size )
Blocking-write a value in the virtualized global address space starting from the specified offset void gmt putValue NB ( gmt data t gmtArray, uint64 t offset, const void * data, uint64 t size )
Non-Blocking version of the gmt putValue
Blocking-read a potion of an array in the virtualized void gmt get ( gmt data t gmtArray, uint64 t offset, void * data, uint64 t size ) global address space starting from the specified offset and copy it into a local array void gmt get NB ( gmt data t gmtArray, uint64 t offset, void * data, uint64 t size )
Non-Blockingversion of the gmt get Perform atomic addition between the specified value int64 t gmt atomicAdd ( gmt data t gmtArray, uint64 t offset, int64 t value, uint8 t size ) and the specified array in the virtualized global address space, starting from the specified offset Perform atomic Compare-And-Swap. Use the specified int64 t gmt atomicCAS ( gmt data t gmtArray, uint64 t offset, int64 t oldValue, array in the virtualized global address space (compare), int64 t newValue, uint8 t size ) starting from the specified offset. Then write the specified value in the virtualized global address space (swap)
void gmt parFor ( uint32 t nThr, uint32 t chSize, void ( *func ) ( int, void * ), Execute the specified function, corresponding to the void * args, uint32 t argsSize ) loop body, in parallel machines. Furthermore, a run-time, rather than hardware components, provide the features than enhance execution of irregular applications. Finally, we plan to limit as much as possible developer intervention in parallelism discovery and code optimization.
III. P A This section introduces our compiler-based methodology to automatically generate parallel code for irregular applications on scratchpad-based MPSoCs, starting from a highlevel specification in C/C++. We briefly introduce LGMT (Lightweight Global Memory and Threading library), the run-time library that enables a more efficient execution of irregular applications on distributed memory architectures, by addressing some of their main issues. We then present the YAPPA (Yet Another Parallel Programming Approach) compiler, mapped on top of LGMT. Finally, we show an example of parallelization.
A. LGMT Library
LGMT is a lightweight run-time library that enables fundamental features for irregular applications on distributed memory MPSoCs. It takes inspiration from solutions developed for high performance computing, optimizing them for embedded on-chip many-cores. First, similarly to PGAS programming models for HPC clusters, LGMT enables a global address space across the distributed memories of the system. This allows developing the application without partitioning the data set. Second, it implements lightweight software multithreading, which allows tolerating latencies for accessing data at remote locations. When a core executes a task that issues an operation to a remote memory location, it switches to another task while the memory operation completes, hiding the access latency with other computation. This approach obviously requires cores with direct memory access (DMA) engines that support remote DMA. Thanks to tightly interconnected scratchpads, task-switching is still effective in hiding latencies even for relatively fast NoCs. Third, LGMT implements a fork/join control model. With respect to SPMD control models, typical of message passing, or PGAS programming models, this model better copes with the large amounts of fine-grained and dynamic parallelism of irregular applications. Table I shows the primitives to interface with LGMT.
LGMT works by manipulating arrays stored in the global memory space. Among the features offered by LGMT, the memory allocation primitives allow expressing locality. A thread can allocate data partitioned (PARTITION) among all the memories of the system, on the memory of the core currently executing the thread itself (LOCAL), or remotely (REMOTE) on all the other memories, except the current one. Put and get operations allow manipulating data in the LGMT memory space, by accessing the requested number of elements at the specified offsets of the arrays. There are blocking and non-blocking operations, with the related wait operation. To maintain a low overhead without assuming complex associative DMA designs, the wait operation is not associated every time with a specific non-blocking operation, but rather waits until all previous operations have been completed.
LGMT also provides primitives for atomic operations, such as addition and compare-and-swaps. Parallelism is identified through a parallel-for construct, which dynamically spawns one or more iterations of a parallel loop as independent tasks. There are allocation policies also for tasks. When LGMT spawn new tasks, it can distribute them uniformly across the cores (PARTITION), it can map them on the same core that encountered the parallel for construct (LOCAL), or it can map them remotely (REMOTE) on all the other cores except the current one. Once assigned to a core, tasks are stored with their contexts in local queues allocated in the scratchpad memory, and do not migrate. Cores can access their own private scratchpads in few cycles, thus providing a rather efficient software multithreading. Whenever a parent task spawns new children, it suspends its execution until the termination of the all the children. This approach avoids expensive task termination checks. Furthermore, if a local task queue is full, and the run-time identifies a request to spawn new tasks, the execution continues in the current context. Task creation, memory allocation and memory operations are commands that are sent, routed and received through message passing operations, which are trivially mapped onto the native, lightweight communication layers of any NoC-based MPSoC. Tasks are described as functions which, similarly to pthreads, take a structure of parameters in input.
LGMT automatically provides an iteration identifier, which acts as task identifier, as an argument of the function. Such identifier allows to compute the offsets for accessing the data in the global address space. A developer can directly use LGMT to implement irregular applications by explicitly calling the API primitives. However, this means parallelizing the code and carefully manipulating the data in the global address space with the put and get operations. YAPPA builds on top of this runtime system. It automatically transforms applications written in pseudo-sequential C (with only synchronization constructs, where required) with a shared memory abstraction into parallel code that exploits the runtime's primitives to manipulate data in the global address space. YAPPA performs the transformations in two steps: data management and loop parallelization. In the first step, it analyzes the code to identify potentially shared data to be allocated in the global address space. Then, it redefines the type of these data as gmt data t, inserting the appropriate memory allocation primitives. At the same time, it transforms all the accesses to these data in LGMT memory operations: gmt get for reads, gmt put for writes of arrays in LGMT data structures, and gmt putValue for writes of scalar values. At this stage, YAPPA also performs alias analysis on the global memory accesses. It identifies independent memory accesses, such as accesses to global data structures, which only are read or written inside the loop body, or global data structures, which at each iteration are accessed on different elements and do not have loop-carried dependences. In these cases, YAPPA tries to move at the beginning of the loop as many independent memory operations as possible, substituting blocking memory operation primitives with their equivalent non-blocking versions and the related wait before the first use of a value. We call this transformation unblocking of memory accesses.
In the second step, YAPPA effectively performs the parallelization, by creating the tasks. YAPPA extracts the loop bodies, and generates the task functions, which have two incoming arguments. The first argument is the iteration identifier, which also works as a task identifier. The actual parameter passed to the function is the iteration index, when the chunk size (the number of iterations that each task executes) is equal to 1, or the first iteration index of the chunk otherwise. The second argument of the task function is a structure, which contains all the variables that are read inside the loop but defined outside. These also includes references to LGMT global data structures. YAPPA performs dependence analysis to identify these variables. YAPPA inserts the allocation and the initialization of this structure into the code. All the variables, which are defined outside the loop, but used only inside the loop, are localized. In other words, the definition is moved inside the loop body, thus avoiding to pass a variable that is dead at the loop exit as a parameter. Because of the distributed memory architecture, passing parameters to tasks corresponds to memory copies and data transfers, thus localizing variables potentially provides lower communication overheads. Once YAPPA has created the task function, it computes the chunk size. YAPPA also accepts chunk sizes as a command line option. In the first prototype of the compiler, we exploit this option to allow hand tuning of this parameter according to the performance provided by the parallelized program. The application developer can quickly explore several chunk size alternatives in reasonable times through simple compilation scripts. We are currently implementing the support for automatically computing the chunk size according to the irregularity level of each loop. Irregularity analysis will be available in future versions of the YAPPA compiler. The transformation continues with the insertion of the instructions for computing the number of tasks at run-time. The number of tasks spawned by the run-time corresponds to the number of iterations divided by the chunk size. If the number of iterations is not an exact multiple of the chunk size, we add another task to execute the remaining iterations. Finally, the transformation concludes with the addition of the call to the gmt parFor primitive, whose inputs are the computed number of tasks, the chunk size, the pointer to the task function extracted from the loop body, the pointer to the structure of the parameters and its size in bytes.
YAPPA parallelizes nested loops by topologically ordering loops according to their nesting level, and by recursively running the parallelization pass on the output of the previous execution. In general, current common parallelizing compilers do not support parallelization of nested loops, because they only look for limited amounts of parallelism, and the target systems do not require large amounts of fine-grained tasks. On the other hand, a custom solution such as the Cray XMT compiler, which targets a massively multithreaded system, concentrates its analysis and parallelization efforts on nested loops. YAPPA performs loop normalization to simplify the computation of the number of tasks. In the first prototype of the compiler, we allow selecting the loops to parallelize through a command line option. The approach is equivalent to annotating parallel loops in the program with pragmas, as it is common in solutions such as OpenMP. Nevertheless, our goal is to extend YAPPA so that it can autonomously and automatically select the loops to parallelize. We also underline that not parallelizing certain nested loops may also provide options for further communication optimizations. YAPPA, in fact, also support another optimization, dubbed block-hoisting. Whenever a loop reads a shared array, YAPPA translates these accesses to get operations. However, if the loop only reads scalar values sequentially, one iteration after the other, from the array, and the loop is not parallelized, this only generates a sequence of very fine grained operations inside the same task. Even if LGMT can tolerate data access latencies by switching to other tasks, there still is benefit in aggregating as much communication as possible. In such a case, YAPPA hoists the scalar read operations from the loop and aggregates them in a single get operation, writing to an entire local subarray. The subarray is allocated in the local memory of the task (i.e., with a standard malloc or with a gmt alloc with the LOCAL policy).
In our approach, loops eligible for parallelization are canonical loops without invoke instructions in their body. YAPPA executes LLVM's lower-invoke pass to transform invoke instructions into call instructions. The reason is that invoke instructions are a particular type of call instructions that the LLVM intermediate representation uses to handle exceptions in C++ applications. If a called function throws an exception, then the related invoke returns 0. Otherwise, it returns 1. According to the return value, the control flow continues with the normal execution or branches to a landing pad. Because loop bodies may contain one or more invoke instructions, and our parallelization scheme transforms loop bodies in parallel and asynchronous tasks distributed across different processing elements, non trivial mechanisms to identify tasks generating exceptions, and to establish a policy to handle the exceptions would be required. Parallelization of C++ applications also requires supporting object serialization and deserialization, when they are passed as task parameters. In the current YAPPA implementation, if a variable of an object is shared, then the entire object is allocated in the global address space. Future optimizations will selectively allocate in global memory only the shared variables, leaving the object and all the other variables in the private memory.
C. Example Figure 1 shows the pseudocode of a simple, sequential queue-based Bread First Search Algorithm (BFS). The algorithm explores a graph described in the Compressed Sparse Row (CSR) representation. The algorithm works as follows.
for (vId = 0; vId < Q N ; vId + +) do 3:
for ( uint64 t i = curIdx; i < nextIdx; i + +) do 7:
uint64 t neighbor = gEdges[i]; 8:
if (gmt atomicCAS (gMarked, neighbor * sizeo f (unit64 t), 0, 1, sizeo f (uint64 t))) then 9:
gMarked
gQ n ext = 0 18: end while Queue Q contains the vertices to explore in the current iteration, gQnext the vertices that will be explored in the subsequent operation. gQ N and gQnext N respectively count the number of elements in queue gQ and gQnext. Initially, gQ only contains the root node of the graph. The algorithm loads the edge list of each vertex in the exploration queue. It does so by accessing the array of indices (gIdxs) that, for each vertex, contains the offset at which its edges are located in the array of edges (gEdges). Each element of the edge array contains the target vertex for the edge (neighbor for the source vertex), following the CSR representation. It then evaluates each neighbor by checking the corresponding entry in the array of the marked vertices (gMarked). If the neighbor has not been previously explored (its marked status is 0), then the neighbor is marked and added to the queue containing the new vertices to explore in the next iteration (gQnext). Checking of the marked vertices array and addition to the queue are operations that, when executing in parallel, require synchronization. The programmer must express them with atomic constructs. We use the compareand-swap for the checking and the atomic addition for the synchronization. It is not required to specify that those data are shared (and thus allocated in the global address space). When the algorithm explores all the vertices in gQ, gQnext becomes the new gQ and gQnext is reset together with its element counter. The procedure continues until a new iteration of the algorithm finds gQ empty, meaning that all the vertices of the graph have been visited. Each iteration of the algorithm visits a level of the graph.
The proposed implementation of the BFS algorithm provides two potentially parallel for loops (lines 2-6). The typical parallel implementation of this queue-based implementation focuses on the extraction of parallelism of the outermost loop, because with the most complex graphs, after a few iterations, the exploration queues already offer abundant parallelism. YAPPA is able to parallelize both the loops, extracting fine-grained parallelism while also allowing precise control of the task size. Nevertheless, in our example, we follow the common approach of parallelizing only the outermost loops. This also allows presenting how YAPPA's communication optimizations work.
1: typedef struct Args t{ 2:
gmt data t gMarked; 3:
gmt data t gIdx; 4:
gmt data t gEdges; 5:
gmt data t gQ 6: gmt data t gQnext; 7:
gmt data t gQnext N ; 8: }args t; gIdx, gEdges, gQ, gQnext, and the scalar gQnext N , which are defined outside the loop. Thus, they all become arguments for the task. YAPPA redefines these arrays as gmt data t, because all the tasks access them. Figure 2 shows the complete arguments data structure. Reads of the elements in queue gQ, and in the arrays gIdxs and gEdges all are on shared data structures, thus YAPPA translates them to gmt get operations. The addition of a new vertex in gQNext is also on a shared data structure, thus it is converted to a gmt put operation. We can also see the block-hoisting optimization. The loop on the edge list is not parallelized. However, this loop reads, for each vertex, sequential elements in the list (from curIdx to nextIdx with a simple iterator). So, YAPPA can aggregate all the reads in a single get operation. Because curIdx and nextIdx are not statically know, it must also dynamically allocate the array. For this example, it is allocated with a standard malloc in the task memory space, so that conversion of memory operations to gets and puts are not required to access it. Figure 3 shows how YAPPA transforms the outermost loop. The task function takes in input the iteration identifier iterId and the pointer to the structure of the parameters Args.
1: void F(uint64 t iterId, void * Args){ 2: args t * args = (args t * )Args; 3: uint64 t vertex, curIdx, nextIdx; 4: gmt get(args→gQ, iterid * sizeo f (uint64 t ), &vertex, sizeo f (uint64 t)); 5: gmt get(args→gIdxs, iterid * sizeo f (uint64 t ), &curIdx, sizeo f (uint64 t)); 6: gmt get(args→gIdxs, (iterid + 1) * sizeo f (uint64 t ), &nextIdx, sizeo f (uint64 t)); 7: uint64 t * neighbors = malloc((nextIdx−curIdx) * sizeo f (uint64 t)); 8: gmt get(args→gEdges, cur[0] * sizeo f (uint64 t), neighbors, (nextIdx − curIdx) * sizeo f (uint64 t)); 9: for ( uint64 t i = curIdx; i < nextIdx; i + +) do 10: Figure 4 shows the parallelized version of the main routine in the BFS algorithm. We show (only for the array 1: uint64 t * gIdxs = gmt a lloc(vertex + 1 * sizeo f (uint64 t); 2: ...; 3: while (Q N ! = 0) do 4:
args t args; 5:
args.gMarked = gMarked; 6:
args.gIdx = gEdges; 7:
args.gIdxs = gIdxs; 8:
args.gQt = gQ; 9:
args.gQnext = gQnext; 10:
args.gQnext N = gQnext N ; 11:
uint64 t nT hr = Q N /chunkS ize; 12:
if (nT hr * chunkS ize < Q N ) then 13:
nT hr + +; 14:
end if 15:
gmt parFor(nT hr, chunkS ize, F, &args, sizeo f (args)) 16: gQnext = gQ; 17:
get(gQnext N , 0, &Q N , sizeo f (uint64 t)); 18:
put(gQnext N , 0, 0, sizeo f (uint64 t)); 19: end while gIdxs, the other are equivalent) how the previous allocations (done with standard mallocs) of the shared data structures are converted to gmt malloc operations. We can also see how the arguments are passed to the task (line 4-10), the computation of the number of tasks (line [11] [12] [13] and the call to the gmt parFor primitive (line 15).
IV. E R
The main benefit provided by YAPPA obviously is a reduction in development time of irregular applications on platforms running LGMT. However, we still want to validate the effectiveness of its transformations. To do so, we selectively apply the transformations on the full queue-based BFS algorithm. We measure the overall performance of each version of the code produced by YAPPA on by executing it on a system which emulates a many core design with private scratchpads running LGMT. The system consists in a quad-processor AMD Opteron 6176SE (codename "Magny Cours") with 256 GB of DDR3 RAM. Each processor hosts 2 dies, and each die features 6 cores and a shared L3 cache of 6 MB, for a total of 48 cores. A core includes a private 512 KB L2 cache and private instruction and data L1 caches of 64 KB each. The processors have a frequency of 2.3 GHz. Some of the cores of this system runs a modified instance of LGMT, which allocates a private memory area, a private task queue and a private communication command queue pinned in the cache to emulate fast scratchpads. The other cores emulate the interconnection network, by performing the data movement operations from one area to the other. We use up to 32 cores for processing and 16 for communication. The processing cores host up to 1024 fine-grained tasks. Table II shows the performance, in seconds, of the BFS kernel on the emulation platform, when applying different steps of the YAPPA compilation framework. The first column (Serial) shows the serial performance of the code (the code is very similar to the example, with atomic instructions removed). The second column (LGMT) shows the performance obtained by only applying loop parallelization for LGMT. The third column (LGMT-BH) shows the Table II: BFS performance when applying the YAPPA compilation framework with LGMT on our emulation platform. V is the number of edges for the whole graph, E is the average number of edges for each vertex. We report time in seconds.
#V-#E Serial LGMT
LGMT-BH 1,000,000-100 performance of the kernel when, beside loop parallelization, YAPPA also applies data management optimizations. In particular, for the BFS, the most important optimization is the block-hoisting of the accesses to the edge list. We parallelized only the outermost loop level of the benchmark. We use graphs of 1 Million vertices where, for each vertex, we change the number of average edges. The size of the graphs range from 0.5 GB to 37 GB. Parallelization allows obtaining reasonable speed ups, given the characteristic of the applications, of the emulated target platform, and the simplicity of the initial sequential code. As expected, the speed up increases as the complexity of the graph (number of edges per vertex) increases. We range from a speed up of 2.5 to 4.22, which for this type of application is reasonable. The block-hoisting optimization provides 6 to 7% higher performance. For comparison, the same queuebased implementation, when parallelized through simple OpenMP and run directly on the host platform (without emulating the target architecture with LGMT), only provides speed ups in the range of 2.3 times the performance of the sequential code.
V. C  F W We presented YAPPA (Yet Another Parallel Programming Approach), a compilation framework based on LLVM for the automatic parallelization of irregular applications on modern MPSoCs. We discussed LGMT, a lightweight runtime library for the efficient execution of irregular applications on distributed memory, embedded manycore architectures. The runtime provides three features that allow better execution of irregular applications on these architectures: global address space, fork/join control model and lightweight software multithreading. We highlighted the features that a compiler requires to better deal with irregular applications. YAPPA builds on top of LGMT to produce a parallel, optimized irregular application, starting from a sequential specification with only synchronization constructs added to the code. YAPPA transforms memory operations to use the runtime's communication API, generates the tasks, builds the data structures and inserts the calls to the parallel constructs. YAPPA also performs several optimizations to better use communication resources. We demonstrated the suitability and functionality of the approach with a prototype of the compiler, generating the parallel code of a typical irregular kernel (graph Breadth First Search) starting from a sequential C specification. We showed scaling for the parallel version of the code and evaluated the performance of the communication optimizations performed by the compiler, demonstrating an increase in performance.
R

