Abstract-As
I. INTRODUCTION
As we transition from a few cores to many cores, scaling the memory architecture is one of the most difficult challenges. Current multicore architectures feature a fully coherent cache architecture. Coherence ensures that a write by any core is visible to all the cores. Each core may then read and obtain the updated values. This makes it easy to support the execution of applications written in the multi-threaded programming paradigm. However, implementing coherence often requires all-to-all communication between cores and the overhead of implementing cache coherency increases dramatically with the number of cores, [2, 3] . NCC (Non Cache Coherent) architectures attempt to circumvent this issue by skipping the implementation of cache coherence in hardware. Such architectures are power-efficient and scalable, but they are difficult to program [4] . NCC architectures are excellent for programs written in a Message Passing Interface paradigm where the communication between tasks is explicitly present in the application. However, programs written in the popular multi-threaded programming paradigm may not execute correctly on these architectures since the values written by one thread on a core may not be propagated to another thread on a different core.
A compromise between current multicore designs (in which all memory is shared memory, but suffers from poor scalability) and pure NCC architectures (in which there is no shared memory, but are scalable) is Hybrid Shared Memory (HSM) manycore architectures -in which there is some shared memory. In HSM architectures the private memory of the cores can be cached, but the shared memory cannot since the caches are non-coherent. Multi-threaded programs can be executed on HSM architectures by mapping the shared data to the shared memory. To enable higher performance HSM architectures may provide some limited on-chip shared memory to improve access to frequently accessed, or longaccess latency, shared data. The 48-core SCC processor from Intel is a prime example. It features non-coherent caches. Pages in the off-chip memory can be configured as shared-amongall-cores or private-to-a-core through page tables. The data in the private pages are cache-able but the shared pages are not. To enable efficient execution the Intel SCC processor provides 384 KB on-chip shared memory (only 8 KB per core).
In the original form, multi-threaded applications can only be executed on a single core of the HSM processor. This will ensure correct execution, since the same core is writing to the memory -so it will be coherent by definition. However, clearly this approach is not scalable since we can only use one core on the HSM processor. The objective of this paper is to enable efficient and scalable execution of multi-threaded applications on HSM processors. To do that, we i) identify all the shared data in a multi-threaded application and map it to the off-chip shared memory. We do this through a series of analytic passes operating on the source-code, which creates an increasingly accurate picture of the shared nature of program variables. For example, initially we assume that all global variables are shared, but then in later stages through points-to analysis, we may be able to identify that some global variables cannot be accessed in more than one thread -and we can classify them as non-shared. Our approach works for well-constructed multi-threaded programs, free from improper thread accesses and race conditions. This approach is scalable, and will have better performance as all the threads can access their private data through caches -only shared data is non cache-able. Performance can be further improved by ii) mapping the more frequently accessed shared data to the on-chip shared memory to achieve efficient execution of multi-threaded applications on HSM architectures.
We have implemented our compiler analysis in the CETUS source-to-source compiler framework [1] . We evaluate the effectiveness of our techniques by measuring the runtime of several parallel benchmark kernels on the Intel SCC processor. As compared to executing multithreaded applications on a single-core, identifying the shared data and mapping it to the off-chip shared memory increases the performance by about 32x. On top of this, by identifying and mapping frequently accessed data to on-chip shared memory, we can improve performance further by about 8x on average.
II. RELATED WORK
The work presented in this paper has two aspects. The first aspect is to identify a conservative but tight superset of shared data. Many hardware-level techniques for identifying shared data have been developed with the intent of better utilizing or improving caches. Bellosa and Steckermeier [5] utilize hardware performance counters in order to detect data sharing between threads with a goal of co-locating data on the same processor. Liu and Berger [6] , Paul et al. [7] focus on cache improvement as well -detecting and preventing false sharing in cache lines or reducing the traffic overhead incurred through cache coherence protocols. In addition there is work in this domain that attempts to detect shared data at runtime. For example, shared memory spaces are explored in von Praun and Gross [8] and Poziansky and Schuster [9] , where thread access is controlled in order to efficiently allocate shared data. Savage et al. [10] need to determine data sharing in order to prevent race conditions; unsafe operations in a program are prevented by employing a consistent locking discipline in order to manage resource contention. The advantage of runtimebased analyses is evident in repeated-run profiling techniques such as Xu et al. [11] and Yang et al. [12] , where the former implement a detector with atomic regions that identify data sharing when multiple threads interact with the regions, and the latter in which multiple runs of the program help detect shared data. We prefer a static analysis approach to avoid the execution overhead of runtime-based techniques. Kahlon et al. [13] use a static analysis technique in order to detect and prevent race conditions which result from improper access of shared variables. Gondi et al. [14] take a different path to preventing race conditions by minimizing the time shared data is kept in memory, purging it as soon as a last-use is detected. However, none of these works is directly applicable for our approach, since we need a compile-time approach to identify shared data in a multi-threaded application.
The second component of our work deals with data partitioning and memory management. The HSM manycore architecture has both on-chip and off-chip shared memory, and both Panda et al. [15] and Kandemir et al. [16] have addressed data partitioning between on and off-chip memory. However, neither consider parallel programs in their analysis. In particular, estimating the number of accesses to program variables is different in sequential and multi-threaded applications. Our work extends theirs by implementing a data partitioning scheme which considers parallel programs and approximates data read and write counts from all the threads. Cichowski et al. [17] use a manual process to port a single multi-threaded program to the SCC. To the best of our knowledge, our technique is novel in that it combines a static shared data analysis within the context of a multi-threaded program and uses it in order to automatically enable application execution on an HSM manycore architecture.
III. OUR APPROACH C POSIX threads (Pthreads) [18] programs present unique challenges for HSM manycore systems due to how global variables and shared data are managed within threads versus how they are handled across processes. In multi-threaded programs, a global variable is implicitly shared between any threads since the threads share the program text, data, and heap space of the parent process. In a multiprocess application, however, each thread from the original multi-threaded program is "mapped" to a full process -one per core. Variables which are global within a process are not implicitly shared with other processes. For proper execution these must be identified and converted to explicitly shared variables accessible through the HSM manycore software API. Functions and data managed by threads must also be transformed to process-based execution. This analysis builds up an increasingly accurate picture of the state of each variable (including pointers) as it appears in the program. The sample program provided in Listing 1 should be useed as a reference for this section. 
A. Variable Scope and Access Frequency Analysis (Stage 1)
This first stage takes as input the multi-threaded program source code and performs a rudimentary analysis of local and global variables. Details such as size of the variable, type, read and write counts, as shown in Table I are extracted. We implement a technique similar to that of Pabalkar et al. [19] . We assess variables based on their context -whether within a procedure, within a loop or nested loops, and whether called within a thread. Such a procedure provides approximate relative read and write counts for each variable. Each step in this, and following stages, represents an analytic pass through an abstract intermediate representation (IR) of the source code [1] . Passes are designed to look as narrowly or broadly within 
B. Inter-thread Analysis (Stage 2)
This stage identifies which variables exist within threads and also which are shared. In Algorithm 1, given a variable name and a list of procedures, the IR is traversed in a DFS manner to locate the variable and the procedure within which it appears. The IR is then searched for the thread which executes this procedure. Based on whether the thread is launched only once, or several times (for example, within a loop), a decision is made whether the variable is within a single thread or multiple threads, and this information is returned. Based on this result the sharing status of each variable (Table II) is updated. Referring back to Listing 1, even though both the variables sum and tLocal exist within the function tf which is launched by a thread, tLocal is defined in the scope of the function (not shared between threads), and has the sharing status set to false. Table I is updated to reflect the name of the function within which each variable was used and/or defined.
C. Alias and Pointer Analysis (Stage 3)
Because potentially shared variables may be hidden behind pointer relationships this stage performs a "Points-to" pointer analysis leveraged from the Cetus translation framework [1] . A brief description of the basic analysis: the goal is to identify the set of memory locations that a pointer variable may be pointing to. Interprocedural pointer information is analyzed via a dataflow methodology where pointer relationships are explicitly identified from pointer assignments including function calls. At each line of the program the analyzer produces a relationship map as output. This data is merged with the pointer information collected from analyzing previous statements, building a comprehensive overview of the pointer relationships within the program. These pointer relationships are classified as definite or possibly, with the latter often occurring after analyzing pointers within an if-else statement, proc ← name of procedure which contains v 4: if proc ∈ F then 5: caller ← pthread create launching proc 6: if caller appears within a loop then 7: return "In Multiple Threads" 8:
seen ← number of times proc appears in pthread create calls 10: if seen > 1 then 11: return "In Multiple Threads" 12:
return "In Single Thread" 14: end if 15: end if 16: end if 17: end if 18: end for 19: return "Not in Thread" If a particular pointer is shared then the object it points to is also accessible in the context of this sharing. Algorithm 2 describes the high-level details of this process. It is possible that the pointed-to symbol is yet another pointer, or it may be a variable. The pointed-to object is retrieved and its sharing status is updated as a shared entity, such as that of the variable tmp given in the last column of Table II . The Points-to analysis offers a powerful capability to extract relationships that are not evident otherwise, and additionally our analysis can be less conservative, since the set of variables which are the same as a given variable is constrained. As Stage 3 ends, refer again to Table II. Notice that global variables which were defined but entirely unused, such as global, may be set as private.
D. Data Partitioning (Stage 4)
This stage uses information from previous stages in order to make decisions about where to place the data in the context of if A relationship exists with s and the relationship is "definite" then 3: ptr ← Pointer symbol 4: ptt ← Pointed-to symbol 5: shared ← ptr status from V 6: if shared is True then 7: shared ← ptt status from V 8: shared ← True 9: update ptt status in V 10: end if 11: end if 12: end for the HSM memory hierarchy. Just as with traditional caches, the size and frequency of access influences where data is stored and for how long. If all of the shared data fits within the on-chip shared memory, then it is collectively allocated into the faster SRAM even if some data is accessed much more frequently than others. A tradeoff is made if not all the data fits on the on-chip shared memory. In line 14, Algorithm 3, the variables are being sorted by size, as in Panda et al. [15] . A slightly modified algorithm also accommodates for sorting by frequency of use, as that metric is retained within the properties collected during the analysis of each variable. Shared scalars may be mapped to on-chip memory readily, with further granularity provided by frequency of access to those variables. Larger arrays may be allocated entirely in DRAM or split between DRAM and SRAM. The shared memory declaration is identical to a dynamically allocated variable in C, the difference is in the name of actual function call. The newly constructed declaration is inserted into the 'main' procedure in the target program, to effectively make the variable or pointer explicitly shared across the entire multiprocess application.
E. Translation Framework (Stage 5)
The final stage implements a source-to-source translator which uses the analysis from stages 1-4 to transform the IR and output C source code. The thread-to-process pass (Algorithm 4) attempts to find functions launched via the pthread create call. This routine accepts four parameters: the thread ID, a thread attribute (or NULL), the function executed by the thread, and an argument (or NULL) being passed to the executing function (see Listing 1). Once a pthread create function is found, the third and fourth arguments to pthread create are extracted and saved. A new function call is generated using the function name derived from the third argument, and is given either the original argument specified as the fourth parameter in the pthread create call, or, a core identifier if the argument passed to the function would be a thread ID and if the target architecture supports a core ID. After inserting the new function call above the pthread create call in the IR, the pthread create call is removed from the IR. Last, the function name and the order of appearance of the pthread create call is noted for subsequent use and stored within a hash table. for all Shared variable s ∈ V do 6:
Create on-chip malloc call, C
7:
Insert put and get calls in P to access on-chip memory 8: if Previous malloc call B for s exists in P then 9: Remove B 10:
Insert C in main function of P 12: end for 13: else 14: Sort V by size, ascending 15: R ← size of remaining on-chip memory 16: for all Shared variable s ∈ V do 17: if s.mem size ≤ R then
18:
19:
Insert put and get calls in P to access on-chip memory 20 : Consider: after the thread to process conversion, an application runs the same executable on multiple cores. In this case, if a particular thread runs on all cores then the information in the hash may be discarded. However, if a task is thread-specific and not delegated across all the other threads, it must be isolated such that it executes only on the given core(s). To isolate such a function within the hash, it is wrapped in an if-condition where the conditional checks if the program is running on the core with the proper core ID. The core ID is the value associated with the function name in the hash table. We ensure that thread IDs correspond 1:1 with core IDs.
IV. EXPERIMENTAL RESULTS

A. Experimental Setup
We perform our experiments on the Intel Single-chip Cloud Computer (SCC) [20] . The 48-core non-coherent cache architecture features a unique on-die shared SRAM (384 KB) called the Message Passing Buffer (MPB). Through the message passing buffer, the cores communicate a limited amount of data directly and bypass both the L2 cache and DRAM (up to 64 GB). Each benchmark test is run on the SCC, each core running Linux, at 800 MHz core frequency, 1600 We run several multithreaded applications on the Intel SCC with and without our analysis and transformation. These applications include a program to Count Primes, to do a Pi Approximation, sum increasingly large multiples of 3 and 5 in 3-5-Sum, LU Decomposition, Dot Product, and also a synthetic benchmark for memory operations, Stream from McCalpin [22] . All applications were compiled for SCC using the Intel C++ compiler (icc) version 8.1 (gcc 3.4.5), and RCCE 2.0.
Without our analysis and transformation multi-thread Pthread programs can only execute on a single core. To enable them to run on multiple cores we need to convert them to RCCE programs. We have implemented our analysis and transformation in the source-to-source CETUS compiler framework [1] . Each component or 'pass' of our framework is a subclass of either the AnalysisPass or TransformPass classes. These classes provide boilerplate code as well as perform some consistency checking to ensure that the program IR remains in a self-consistent state. The Driver class brings together all passes and executes them in series to analyze and make iterative changes to the IR. We use Java 1.6, ANTLR 2.7.5 
B. Mapping shared data to off-chip shared memory improves performance by 32X
As an evaluation baseline we run each Pthread application on a single core of the SCC. We then generate a RCCE variant for each program which takes advantage of 32 cores of the SCC and utilizes off-chip shared memory, and measure the runtime. The Pthread benchmarks were built for 32 threads and the RCCE applications utilize 32 cores. Pi Approximation, 3-5-Sum, Count Primes and Stream achieve improvements of 32x, 29x, 16x and 17x, respectively. Fig. 1 shows the relative performance increase for each application (using only off-chip shared memory). The RCCE applications for Dot Product and LU Decomposition have large arrays in off-chip memory and have at least 8 cores in contention per memory controller. Although the performance benefits of 32 vs 1 core are hardly surprising, our work of converting multi-threaded programs to run as HSM applications makes this comparison possible.
C. Using on-chip shared memory further improves performance by 8X on average
Comparison of RCCE programs which only use off-chip memory vs those that utilize on-chip memory is given in Fig. 2 . Programs which either exhibit a high degree of memory usage or those that balance memory use and core computation see the most performance improvement using the MPB. For example, Stream already benefits from the parallelism via 32 cores, versus a single core with competing threads. In addition, when converted to utilize the MPB, not only are the memory accesses distributed across the cores, the locality for coreto-MPB is much closer than than of core-to-DRAM. Finally, MPB transfers may be done in bulk copy of memory (often contiguous addresses) further improving performance for an all-memory synthetic benchmark. LU Decomposition is an interesting case, as the matrix within that program does not fit into the on-chip shared memory. For a very slight performance improvement a small portion of the matrix (few rows) may be allocated separately on the MPB.
D. Enabling Scalable Applications on HSM Architecture
Converting multi-threaded programs to take advantage of multiple cores of the HSM architecture enables scalability. While this is application-dependent, programs with sufficiently large computations and which transfer data between cores using the on-die MPB can achieve significant performance increases with increasing core count. See Fig. 3 .
V. CONCLUSION
We present a novel analysis and translation framework used to convert and enable applications to run on the Intel Singlechip Cloud Computer. Our approach automatically analyzes the multi-threaded source program and extracts the properties of all variables (shared and private) and efficiently maps the shared data to available on-chip and off-chip shared memory. Our technique is used to convert incompatible or inefficient programs by leveraging architecture-specific transformations, with experimental results demonstrating the suitability and performance benefits of enabling multi-threaded applications for efficient execution on HSM manycore architectures. The limitations of our work are: we limit source programs to those which do not use the non-portable ( np) Pthread extensions. Our analysis is also limited to the maximum number of cores as are on our experimental platform (48). However, the framework is not dependent on, or limited by, a given number of cores and is scalable to platforms with different core counts.
