In this paper, we present an on-chip memory store called "Local Memory Store (LMStr)"which can be used with a regular cache hierarchy or solely as a redesigned scratchpad memory (SPM). The LMStr is a shared special kind of a SPM among the cores in a multicore processor. This memory hierarchy is hardware-controlled in terms of management of the store itself. Yet, compiler support is instrumental in deciding which data items/types should live in the store. Critical data should be stored in the LMStr according to its type (i.e., local, global, static, or temporary). The programmer can provide, at will, hints to the compiler to place certain data items in the LMStr. We evaluate our design using a matrix multiplication microapplication and multiple Mantevo mini-applications. Our results show that LMStr improves data movement by up to 21% compared to cache alone with a mere 3% area overhead. Not only that but LMStr improves the cycles per memory access by up to 40%. It also projects up to 85% less dynamic energy consumption compared to traditional cache.
INTRODUCTION
Power consumption constraints are a major driver in processor design on equal if not higher footing to performance. The energy budget to move data within the memory hierarchy is reported to consume 28-40% of the total energy in high performance server processors [16] . A significant amount of unnecessary data movement Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. occurs due to conventional caches which are hardware-controlled. Caches are oblivious to the type of data (i.e., it treats a temporary variable similar to a global or a local one). In the meantime, the programmer most of the time does not worry about the underlying cache organization. This abstraction automates the complexity and operations of the cache hierarchy. Furthermore, it simplifies both the programmer's and compiler's jobs. As a consequence, the programmer and compiler do not control the placement or movement of a specific variable unless they go out of their way with interventions that introduce dummy variables like in the case of padding [4, 5] . Moreover, caches have a fixed block (i.e., cache line) size. Previous research shows that, for some of the Mantevo miniapplications, cache line utilization is less than 50% [36] . Figure 1 shows a histogram of the cache line size distribution for two Mantevo applications (CoMD and miniFE) if the cache line size is allowed to vary. A fixed block size is not a good choice for many applications.
This means many variables are evicted from the cache without being accessed even once. This causes cache pollution resulting in unnecessary data movement and unnecessary traffic on the chip interconnects, i.e., wasting bandwidth and significant amount of energy [18] . Typically, caches occupy 50% of the chip area, use 70% of the total number of on-chip transistors, and consume 25-45% of the total chip power [42] . Data movement is further exacerbated by the hardware-controlled and software-oblivious replacement policies which could possibly evict a cache line that are going to be re-fetched again soon after. In addition, multithreaded applications can share data that can be duplicated across the system in private caches and require coherence protocols to maintain the validity of the shared data in the memory hierarchy. Some cache coherence protocols suffer from poor scalability with increased bus traffic and execution latency [25] .
To reduce data movement and power consumption, processor and memory architectures are evolving to include near-memory computing, deeper cache hierarchies, new lower-power memory technologies, and scratchpad memories. Very few recent works have examined the use of scratchpad memories (SPM) as an alternative to caches for on-chip storage [34, 35] .
A scratchpad memory (SPM) is a small on-chip static RAM that is mapped onto the address space of the processor at a predefined address range of a process [30] . It requires compiler support and possibly the programmer's intervention. SPM stores data in variable sized blocks that ensures almost perfect utilization of the storage space. SPMs are considerably more energy and area efficient than caches, since no tag bits are needed to identify the stored data. Moreover, an SPM would store only a single copy of shared data which would eliminate coherence issues for SPM in multicore processors. However, one of the shortcomings of SPM is sharing storage space between multiple threads/cores in multicore system. Although, the scratchpad stores a single copy of data, due to lack of efficient hardware mechanism and proper compiler support, it introduces significant performance constraints in multicore systems [13] . Furthermore, multiple processes sharing an SPM introduces inefficient use of scratchpad space [29, 40] . Therefore, the usage of scratchpad memory is limited in the literature to embedded systems where applications with very limited parallelism can run together that requires a limited amount of space in an SPM.
Also, some GPUs and gaming consoles processors such as IBM Cell processor introduced an SPM, where the GPU cores have private (i.e., non-shared) SPMs. However, the processes which use GPUs, share limited amount of data between threads that encourage using SPM.
We propose an on-chip memory structure that we call "Local Memory Store (LMStr)"that implements a shared on-chip SPM. It can work with a conventional cache hierarchy, connected to DRAMs for high-performance computing systems. We define an SPM for our purpose as a high bandwidth hardware controlled on-chip memory that is shared among multiple cores intended to store data items in groups according to their types. Thus, our SPM is conceptually and architecturally different than conventional SPMs as in GPUs or embedded system. Moreover, LMStr stores programmer specified data with compiler support to guide the mapping of the data in the LMStr. As a consequence, the LMStr memory architecture exploits data locality better than a cache and reduces data movement between on-chip (i.e., SPM) and off-chip memory (i.e., main memory) by supporting variable-sized data blocks in the LMStr. Furthermore, LMStr eliminates data duplication. Additionally, LMStr is capable of managing the data storage requirements of multiple processes via its hardware controlled allocation technique. Figure 2 shows a simplified layout of the LMStr in a multicore processor, where four cores share an LMStr and a conventional cache hierarchy with private and shared caches. One of the key challenges in realizing an efficient LMStr for scientific and high performance computing is data placement in the shared scratchpad and implementing the capability to keep the data consistent among multiple threads. We focus on a placement methodology supported by a synergy between the hardware and the software (i.e., compiler and the program itself).
However, generating and accessing data blocks with multiple memory items is a complex design challenge since synchronizing the compiler and the hardware is a must. We propose a technique to generate data blocks with the same types of variables in program segments that should be handled by the compiler. Implementing a compiler is beyond the scope of this paper. Rather, we implement a block generator that combines memory references into variable sized memory blocks based on temporal and spatial locality (see Section 4.1). The contributions of this paper are summarized as follows:
(1) Propose a shared compiler guided hardware controlled onchip scratchpad memory structure (LMStr) for multicore processors. (2) Quantitatively compare the performance (including data movement analysis) of a traditional cache hierarchy only, an LMStr only, and a cache backed by an LMStr. (3) Create a block generation tool that combines memory references into variable sized blocks using temporal and spatial locality characteristics of the original reference stream.
The rest of the paper is organized as follows: LMStr architecture is presented in Section 2 and Section 3 describes the design methodology. We discuss experiment methodology and performance results in Sections 4 and 5 respectively. We summarize the related work in Section 6 while Section 7 concludes the paper.
LMSTR ARCHITECTURE
The LMStr is a hardware design that is compiler supported with possible hints from the programmer to the compiler. In the following subsections, we discuss the design of the LMStr. We also discuss the overheads that we expect to incur. Figure 3 by the dotted line to the right of the LMStr Data Storage.
LMStr Hardware Design

Private Mapping
LMStr Overhead(s)
There are three sources of overheads in the LMStr. First, overhead due to the code the compiler generates. Second, overhead due to the extra hardware added. Third, overhead due to resource sharing in parallel execution.
To access data in the SPM, we send the reference with its LMRef index. This increases the binary code size. Also, the compiler generates additional instructions to get the updated data. Moreover, we implement another data structure LMRef descriptor to track the elements in a block. It consumes space in main memory. Also each process stores a private copy of (masterLMRef ) that occupies some space as well.
The hardware overhead of LMStr architecture consists of LMRef table in each core, an LMDir table and a LMStr storage space. All the elements of LMStr are proposed to build with fast 6T SRAM as cache to achieve high speed data access.
In the current design of the LMStr, not necessarily the only way to implement the LMStr, some of the hardware components (LMDir, LMEngn, and LMStr data storage) are shared among all cores which can introduce contention and ultimately serialization of execution (e.g., inter block data update or block allocation). However, we can multi-port these shared structures, and a distributed shared design (similar to tiled architectures) can decentralize these resources which minimizes the effects of shared resources.
LMSTR METHODOLOGY
LMStr is a compiler supported hardware controlled memory that stores programmer or compiler suggested critical data in the SPM. The compiler inserts code at compile time to copy a data block (group of same type (local, temporary etc) of variables) from main memory into the SPM. Blocks allocated in the SPM can be evicted to main memory to make room for new blocks. Like in caching, main memory retains a copy of the data even when a copy is in the SPM. The compiler knows the location of each variable at each point in the program, no runtime check/comparison is needed to find the location of any given variable. Consequently, the overheads and unpredictable latencies of hardware caching are avoided.
The basic idea in LMStr is that the programmer is responsible to flag the individual program variables or types of variables (static, local, global, temporary, and arrays) that should be stored in the SPM. The compiler generates data blocks by combining similar type of flagged variables and maps the blocks to LMRef entries. Next, at run time, the blocks are physically indexed with the mapped LMRef entry which indexes to an entry in the shared LMDir table. The LMDir entry indexes to the shared data storage (LMStr) where the data is stored. However, generating blocks and specifying LMRef entry for a block at compile time requires optimized compiler support, and storing data in SPM requires hardware modifications.
We focus on running multithreaded programs on a multicore system that includes an shared LMStr. Therefore, we summarize the compiler support needed at a high level but explain the hardware in greater detail.
Compiler Support for LMStr
The compiler extensions to support the LMStr comprises of three modifications. First, we extend the code generation phase of the compiler. We generate blocks with targeted variables types or similar types of flagged variables at the function/procedure or basic blocks levels (defined as region). Second, we extend the phase to map the blocks to the LMRef table by utilizing a graph coloring algorithm [20] . Third, we introduce a new data structure called LMRef descriptor that tracks mapping of variables in blocks.
Identifying Regions:
The compiler identifies the blocks of each type of variables in program procedures where the procedure/function is defined as an initial region. In our design, the compiler has knowledge of the number of entries in the LMRef table and the virtual amount of scratchpad space that the process may use. If the available resources in the system (LMRef entries and the LMStr storage space) cannot satisfy requests that span an entire function or initial region, we divide some of the initial regions into basic blocks to limit the size of the generated blocks for the SPM.
The algorithm to define the region of a procedure is as following:
• Step 1: Generate control flow graph (CFG) for the program code, where each procedure (and loop) in a program code is defined as a node. The successor nodes are parents to the predecessor nodes that are connected with edges. Edges to procedure nodes represent calls; edges to loop nodes show that the loop is nested in its parent. The successor nodes are called parents and predecessor nodes are children. Here, the number of blocks is the required LMRef entry and the cumulative size of the blocks is the required SPM size for the node. However, the required LMRef entry in parent node is the cumulative LMRef entry of all child nodes plus it's own required LMRef entry. Similarly, the required SPM size in parent node is the cumulative SPM requirement of all child nodes and it's own required space.
• Step 4: If no procedure requires LMRef entry or SPM size greater than available resources, go to step 6. Also, If there are procedures that require more resources than available but the procedures are already divided into basic blocks go to step 6. • Step 5: If the number of LMRef entries or the size of SPM is greater than available resources for any procedure, identify the farthest node from the first parent node and divide the region from procedure to basic block for that node. However, while the definition of procedure is reduced to basic block and multiple basic blocks exist in a procedure, the new required LMRef entry and SPM size is the maximum number of LMRef entries and maximum SPM size required in any basic block of that procedure. Now, identify the flagged variables in the basic blocks and mark each basic block as a region. Determine the number of blocks (required LMRef entry) and estimate the cumulative size of blocks (required SPM size) for every basic block. Determine the largest required LMRef entry and SPM size for any basic block of that procedure that are the newly required LMRef entry and SPM size for that procedure. Repeat step 1 -4.
• Step 6: Region determination is complete and allocate LMRef entry for each block by placement algorithm described in Section 3.1.2. Figure 4 (a), where oval nodes represent procedures, and circular nodes represent loops. Let's assume Loop1 and Loop2 requires 4 LMRef entries each, and 2 KB and 1 KB of SPM space respectively. However, they are executed in main() and Proc-C respectively. In proposed algorithm, the required LMRef entry in parent nodes are the cumulative LMRef entry of all child nodes and it's own required LMRef entry. Therefore, Proc-C requires cumulative LMRef entries of its own variable blocks and blocks of Loop2 and requires cumulative SPM size of its own variable blocks size and blocks of Loop2. Similarly, main() requires the cumulative LMRef entry of it's own and all other procedure's need. Now, we assume available LMRef entry is 12 and available SPM size is 10KB. Therefore, the definition of region in Proc-C needs to be shorten from procedure boundary to basic block boundary. Now, procedure Proc-C has multiple regions and required maximum LMRef entry in a region is reduced to 10 and reduced SPM size is 3KB. Then we recalculate the require LMRef entry and SPM size for the program. Let assume, main() function still requires more LMRef entry and/or SPM size. Hence, the definition of region for main() needs to be shortened from procedure to basic block. The resulted graph is described in Figure 4 (c).
In LMStr, each data block is valid for a group of instructions, that are defined as regions. However, for multi-threaded application, the definition of a region needs an extension and should be defined into two categories according to how the thread uses the region.
In the first category, the thread executes instructions of one or more consecutive regions that are not shared by any other thread. In the second category, multiple threads can share and execute the same instructions of one or more consecutive regions but with different data.
In multicore systems multiple threads can run simultaneously on multiple cores. Since at compile time, we do not know the number of executing threads, the region ordering cannot be maintained and thus we have to introduce a runtime data allocation technique for multi-threaded applications. In the next subsections, we discuss how to handle these two cases.
Regions Accessed by Only One Thread: In this category, threads execute on own regions and thus multiple threads can execute concurrently in multiple cores that initiate multiple block allocating to same LMRef entry. It introduces two types of design constraints as LMRef entry hazard and data duplication hazard. LMRef entry hazard: Accessing multiple data blocks with same LMRef entry can produce erroneous access to data blocks. This issue can be resolved by assigning an serial number to each block (blockSID) that are mapped to same LMRef entry at compile time. It is possible by scanning the code after code generation phase. Data duplication hazard: Multi-threaded application running in multicore system breaks the block transfer sequence that introduces multiple instances of a data item in scratchpad by pre-fetching the data in blocks. We notify the consistency issue as a data duplication issue and can solve it by inserting inter block data update instruction at compile time (Section 3.2.4).
Regions Shared by Multiple Threads:
The data in LMStr is stored in blocks that are used in a region. In a shared region, the data blocks contain static, global variables, and certain array variables blocks shared among multiple threads. Therefore, a single block can serve multiple threads. On the other hand, each thread has its private blocks for temporal, local, and certain array variables in a region. These data blocks from different threads are grouped together to make a multi-block array to store in the SPM.
Generating multi-blocks by combining private blocks from different threads and accessing them efficiently is what we want to achieve even though it is not an easy problem. Therefore, we extend the thread creation library to generate a special ID called serialID. This is a unique serial number for each thread sharing regions with other threads. For example, when a program has four regions where the first region is shared by four threads and the third region is shared and executed by eight threads. The thread creation library generates the unique serialID for each thread from 0-3 for the first region and 0-7 for the third region. The allocation, deallocation and data access instructions of all the private blocks will contain the corresponding serialID. Typically, the serialID is used as the serial number of a private block in a multi-block.
Blocks mapping to LMRef entries:
The basic idea of the LMStr model is to map an LMRef entry for each block at compile time. However, as the number of blocks is presumably a lot larger than the number of LMRef entries, we have to formulate the LMRef management problem into one that can be solved by an existing graph coloring algorithm for block allocation [10] .
To implement a graph coloring algorithm to map blocks in LMRef entries, we first need to identify the blocks and their live ranges in a region. Live range of a block is defined as in-between instructions from the first instruction that uses any variable in a block to the last instruction that uses any variable of the same block. We determine the live range of a block by extracting live range information of variables from a data flow graph (DFG). However, instead of investigating the live range of a variable in complete program code, we investigated the live range in procedures/functions, loops and basic blocks (called region). Hence, we introduce a block interference graph (BIG) which is similar to register interference graph (RIG) except BIG shows the interference between blocks and RIG shows interference between variables. In a block interference graph, each node represents a block and the edges connected between nodes which overlap in live ranges.
Typically, graph coloring algorithm in nodes is defined as an assignment of colors to nodes, such that nodes connected by an edge have different colors. A graph is k-colorable if it has a k colors. In our block mapping problem, we have to map blocks in k-number of LMRef entries. However, though we determine blocks in regions (procedures/functions, loops and basic blocks), the assignment of LMRef entry (colors) is done on the entire program (globally assigned). However, coloring the interference graph is a hard problem (NP-Hard). Therefore, we propose to use a heuristic that is commonly used in register allocation with RIG. The algorithm of mapping t-blocks with k-number of LMRef entries is described as following:
• Step 1: Pick a node with fewer than k neighbors. If there is no node with less than k neighbors, pick a node and identify a neighbor node that would not be colored (would not be mapped to any LMRef entry). That node (block) would bypass LMStr (LMRef spill) and be stored in memory. However, all the uncolored (bypassed) nodes (blocks) would be assigned with distinct values that are greater than the number of LMRef entries. Do this step again till the node has fewer than k neighbors.
• Step 2: Push the node on a stack and remove it from the BIG.
• Step 3: Repeat step 1 and 2 until the graph has one node.
• Step 4: Then start assigning colors (index of the LMRef entry) to nodes (blocks) on the stack (starting with the last node added). Pop the top node from the stack and color (map to a LMRef entry) with different colors from those assigned colored neighbors.
• Step 5: Repeat step 4 till the stack is empty.
In Figure 5 , we show the live ranges of blocks and mapped them to four LMRef entries with a graph coloring algorithm. Figure 5(a) demonstrates the live ranges of program variables for two procedures, as procedure A and B. Procedure 'A' uses multiple types of variables where La1, La2 are local variables, Sa1, Sa2 are static variables, Ga is a global variable, Ta1, Ta2, Ta3 are temporary variables and Aa is an array variable. Similarly, procedure 'B' uses Lb1, Lb2 as local variables, Tb1, Tb2, Tb3 as temporary variables and Ab as an array variable. Figure 5(b) shows the blocks generated by combining same type of variables, as example, block La is generated by all the local variables (La1, La2) in procedure A and block Lb is generated by all the local variables (Lb1, Lb2) in procedure B. Also, the live range of the block with local variables (instruction 1250 to 2250) in procedure A is starting from the first time a local variable in procedure A is accessed (La1 at instruction 1250) to the last time a local variable is accessed (La2 at instruction 2250). In Figure 5 (c) we show the block interference graph (BIG) for these blocks and mapped (colored) them to four LMRef entries (colors) with the graph coloring algorithm.
Hardware Management Methodology
In Section 3.1, we demonstrate the methodology to map the data blocks to LMRef entries at compile time. This subsection describes the hardware procedure to store, evict and access the mapped data blocks at runtime.
Typically, a program is compiled with the knowledge of the number of LMRef entries in each core. At run time, the mapping is stored at LMRef table and data blocks are moved to/from SPM and main memory. However, as compiler generates blocks according to types and usage in a region, blocks are variable size. LMDir is the table that index from the LMRef table to the SPM. Each entry in the LMRef tables of each core point to different entries of the LMDir.
The hardware organization is managed by a controller, LMEngn. The LMEngn finds a free spot for a new block and allocates it, frees storage while on a deallocation request, identifies and sends the requested data from a block to CPU, manages SPM bypass requests and de-fragments the storage space. Moreover, the controller handles the bypassing the LMStr as explained earlier.
Block Allocation:
A typical block allocation request in multithreaded application consist of LMRef entry, blockSID and size of the block. At first, the masterLMRef of the process is checked against the requested LMRef entry. If the entry in masterLMRef is un-allocated, the LMEngn initiates to allocate the block. Typically, the LMEngn find an un-allocated entry in LMDir table and a free place in data storage equal to or greater than size of the requested block. The index of the free place (LMBlndx), blockSID and size of the block is stored in the LMDir entry. Also the LMDir entry is flagged as data. The entry of the LMRef table of the core is updated with the LMDir index, size of the block, and flagged as allocated. Similarly, the entry in the masterLMRef is updated with LMDir index and a number of blocks as of '1'.
Multithreaded application running in multiple cores introduces dynamic scheduling of block allocation that introduces multiple blocks allocating to same LMRef entry concurrently (LMRef entry hazard). The LMEngn updates the entry in masterLMRef by incrementing number of blocks. A new LMDir entry is used that index to an LMBlndx that stores the blockSID, size of the block, LMBlndx, and allocation flag for each blocks mapped concurrently to the same LMRef entry. The LMDir entry stores the LMBlndx, size of the information stored in the index, type of the information as non-data, and allocation flag. The masterLMRef is updated with the index of the LMDir entry. Now, the LMEngn performs the new block allocation procedure explained in previous paragraph.
If the block is a member of a multi-block, the allocation request includes the serialID that specifies the position of the block in the multi-block. However, if the block allocation request is the first request for the multi-block, two un-allocated LMDir entry is required to allocate. The first LMDir entry is used to track the position of all blocks in a multi-block that are stored in data storage and second LMDir entry is used to track the requested block in the multi-block. The masterLMRef is updated with the index of the first LMDir entry, and the entry of the LMRef table of the core is updated with the index of the second LMDir entry. The first LMDir entry is used to index to an LMBlndx that stores the serialID, size of the block, LMBlndx of the block and block allocation flags for each block as data. The LMDir entry stores information similar to multi-block allocation to same LMRef entry. The second LMDir entry stores information of the requested block. The index of the second LMDir entry is only stored in the LMRef table of the requested core as the thread needs to access only its own data. If the block allocation request is not the first allocation request in a multi-block, the entry in masterLMRef and the first LMDir entry is already allocated. The LMEngn would identify the serialID of the block from the block allocation request and allocate it as a new block.
If a region is shared by multiple threads, some of the blocks are also shared among the threads. The compiler generates block allocation request mentioning those blocks as shared blocks between multiple threads. Therefore, all the sharer threads initiate the allocation request but the first thread can actually store the block and others requests are ignored. The allocation request of this type of block consists of blockSID, LMRef entry, and size of the block. The LMRef entry of the masterLMRef is checked for the allocation flag. If the LMRef entry is not allocated yet, it is the first allocation request for the shared block. The LMEngn initiates a typical new block allocation procedure. However, if another thread initiates an allocation request to the same block, it would find the LMRef entry occupied (allocated) with a block and find LMDir entry with same blockSID, and in shared data flagged. Then the LMEngn would ignore the allocation request. Also the number of block in masterLMRef entry is incremented by one.
Data Access:
The data access request in LMStr memory comprises of LMRef entry, offset and size of the data. The LMRef entry index to a LMDir entry that points to a LMBlndx entry. The LMBlndx is the starting position of the data block that holds the requested data item. The LMEngn adds the offset with the LMBlndx to get the actual position of the requested data and send the data. However, for a single threaded application with static scheduling, the data access request with mentioned pattern is adequate to get the actual data. For the multi-threaded application running in multiple cores the memory request is required to extend due to the out-oforder execution caused by dynamic scheduling of threads controlled by operating system. Moreover, in multi-threaded application due to context switching returning thread also needs to find its data in an efficient way. Therefore, to overcome these issues, in our design we extend the memory access request with blockSID and in some cases also with serialID.
However, the situation becomes complicated when a thread moves to idle state and returns to active state (context switch) leaving its block in data storage. At the time of context switching, LMEngn removes the information in LMRef table of the core and the corresponding entries in LMDir. When the thread is re-activated, the first instruction to access a block finds LMRef entry empty. Therefore, the LMEngn access the entry of the masterLMRef (same as requested LMRef entry) of the process and find the LMDir index. The LMDir index to an LMBlndx where the information of the blocks are stored. As multiple blocks are mapped to same LMRef entry, blockSID is used to find the block of the thread. An LMDir entry is allocated to store the blockSID and LMBlndx of the requested block. The LMRef entry is updated with the LMDir entry and thus the data can be accessed with the LMRef entry. This data access procedure has been shown in Figure 6 where index1 in LMDir stores the index (LMBlndx1) of LMStr data storage where the block information is stored. The starting position of the block in data storage is LMBlndx and requested data is stored into LMBlndx2. However, the serialID is used to find the exact LMBlndx of the requested data block in the multi-block that follows similar procedure as multiple blocks mapped to same LMRef entry.
Block Deallocation:
The block deallocation request for a single threaded application consists of LMRef entry only. The LMRef entry contains the size of the block and index to an LMDir entry that point to an LMBlndx. The data block from LMBlndx is either removed or updated to main memory depending on the type of the block.
However, in multi-threaded application the de-allocation request has additional parameters as blockSID and sometimes also with serialID. When a block de-allocation request is issued, the masterLMRef and its pointed LMDir entry is checked for the position of the block. The LMEngn removes the block and updates the LMDir and masterLMRef. However, if the block is recently used, the LMRef entry can be used directly to de-allocate the LMDir entry and LMBlndx of the data block.
Inter Block Data Update:
Multithreaded application running in multicore system executes multiple threads parallel that initiates out-of-order block allocation. As the blocks are consist of multiple data items, out-of-order block allocation introduces multiple copy of same data item in LMStr data storage. However, among the multiple copies some of them are old copies that are required to be updated before use. Therefore, a instruction consist of LMRef entry, offset, size of the data, blockSID (sometimes also serialID) has been issued for each data item that are required to update the old data in another blocks. Typically, only static, global and array variables require inter block data update.
Generally, after the data item is last used in a block, the inter block data update request checks for the existence of the block containing the data item through accessing masterLMRef. If the block is found, the data is updated and if the block is not found, LMEngn ignores the request. The number of inter block data update request can be reduced by tracking every data item in a block whether they are dirty or not. It requires extra bit for each data item stored in data storage that initiates wastes of scratchpad space. 
Data storage De-fragmentation
Typically, due to a large variability of block sizes it is certain that, holes created by block de-allocation in storage space cannot be completely filled with new blocks and thus requires de-fragmentation. The situations exacerbate when multiple multi-threaded process share the same storage. Therefore, it is mandatory to fill the holes to make rooms for new blocks. The LMEngn do the de-fragmentation while storage space is in idle state. In de-fragmentation, the data blocks are moved to fill the holes and corresponding shared table entry (LMDir entry) is updated with the new position of the block.
EXPERIMENTAL SETUP
To implement the LMStr, We need to have compiler support as well as memory hardware modification. The compiler extension is required to generate blocks consisting of similar types of variables in a region. Moreover, the compiler is responsible for generating instructions to move blocks back and forth to/from scratchpad, generate data access request, and maintain data consistency between multiple copies. As, the focus of this work is not the compiler which is one of our future work items, we bypassed the compiler modification but creating a similar environment with a block generator tool that generates the necessary attributes that a compiler would generate. The hardware of the LMStr memory is implemented in "Ariel" processor framework that is attached in the structural simulation toolkit (SST) [32] . In SST, the application executes in ariel cores which are included in the simulation toolkit. The memory references of the applications running in the cores is fetched with an attached PIN tool [24] which is used to generate blocks. We use the blocks to estimate the performance of local memory store (LMStr) for multicores. Figure 7 explains the structure of LMStr model in SST. In the following subsection, we briefly discuss the block generation, block decider, LMStr configuration in SST simulator, application characteristics and performance metrics.
LMStr Block Generation
There is no existing compiler that generates data blocks consisting of similar type of variables and up to our knowledge there is no simulator that has caches with variable sized blocks. However, memory access patterns and memory mapping in virtual address space provide insights upon which we can generate variable size data blocks.
Current compilers are optimized to make the data access pattern in such a way that exploits the data spatial and temporal locality. Also, in current mapping convention, same type of data variables in a procedure or function are grouped together and mapped into segments in virtual address space. Therefore, if we can carefully and thoroughly observe the sequence of memory accesses, we can identify the localized data and can generate the data blocks from the localized data that can be similar to a compiler generated block.
However, in our implementation, we generate the memory references via a PIN tool [24] and pass the memory requests through a window. Then, we combine the consecutive memory addresses to generate the blocks according to their types. We build a block generator that combines the memory references according to its spatial locality.
However, the number of entries in the window has a great impact on generating accurate block sizes since a few number of entries are unable to exploit typical locality of the data access and can be misleading by generating smaller blocks that underestimate spatial locality. On the other hand, excess number of entries store the blocks extended times in window that may generate larger blocks that overestimate spatial locality.
In our experiments, we have found the that best number of entries in the window is 128. Moreover, we have assumed the minimum block size as four bytes and maximum block size as 1 KBytes. After a block is generated, an LMStr block uses a function to decide if the block is to be stored as an LMStr block or non-LMStr block. The LMStr blk decider chooses to store a block in the LMStr, if it is accessed multiple times (Figure 7) .
After deciding if a block will be in the LMStr, it is required to analyze the access profile of the block that is done with LMStr blk analyzer. It is necessary to profile the blocks to determine the required size of the LMStr and estimate the number of entries required in an LMRef table for each process. In practice, an LMStr architecture suggests to prefetch the compiler generated blocks into the LMStr at its first access. On the contrary, in our simulations, we generate the blocks by combining consecutive memory references to the block at the time of program execution. Therefore, we do not have any assumptions about the size of a block when the block has been first accessed. We find the actual size of the block when the block is evicted from memory.
Benchmarks
To understand the LMStr performance, we investigate benchmarks' code and identify the types of their variables. Then, we map some of the variables in scratchpad and the rest of them to the cache. We use matrix multiplication and four benchmarks from the Mentevo benchmark suites in different application domain. For matrix multiplication, we manually identify the static, global, local and array variables in the application's code and map their virtual addresses to LMStr. However, the temporary variables have been identified in an indirect way based on the definition of the variable. To do so, we assume that the rest of the variables that are not accessed by multiple cores are possibly temporary variables. However, we used a parallel matrix multiplication code written in C/C++ and OpenMP. Similarly, we have selected only those applications from Mantevo that are written in C/C++ and has a parallel version in OpenMP as SST support only OpenMP for parallel execution. Table 1 summarizes the characteristics of the benchmarks used from the Mantevo suite developed in Sandia National Laboratory [14] .
LMStr and Cache Configuration
To evaluate the performance of LMStr, we run a multicore while varying the number of cores and compare the performance of LMStr against a system that has a cache hierarchy similar to that of an Intel Sandy Bridge processor. The processor configuration used in simulation is summarized in Table 2 . We got the cache access times from published information about the Intel Sandy Bridge processors [19] . The LMStr size needed is application dependent. However, we have identified the maximum number of LMRef entries, LMDir entries, and LMStr size required for each configuration and benchmark. Tables 3 and 4 summarize the number of entries needed in the LMRef table and the size of the LMStr for each application. For simplicity, we estimated the performance of an LMStr with 64 entry LMRef table, 256 entry LMDir table and a 128KByte LMStr. We choose this configuration since it satisfies 90% of our applications' requirements.
Moreover, we have also estimated the area, energy and the number of cycles it would take to access the different parts of the LMStr. The access time, energy and area of LMRef, LMDir table, and LMStr data storage has been calculated using CACTI 5.3 (rev 174) for 32nm process technology [26] . The area of LMRef We estimated the access time and energy of LMStr data access by summing the access time of individual components of the LMStr (0.886ns) and summation of the energy of individual components of the LMStr (0.029nJ). However, we have ignored the effect of static power while calculating energy as it was very small. In our estimation we have assumed that it takes almost the same amount of energy for reading and writing in the cache or the LMStr. Further, the bus width between the cores and cache or LMStr is 16 Bytes and to/from main-memory is 32 Bytes.
One limitation of our work is that we have ignored contention while estimating performance. Furthermore, the memory reference trace is dynamic therefore the memory trace also contains the nottaken side of branches that partially keep negative effect on the performance. This would be different from what a compiler would get from an application that is statically analyzed (i.e., branch dynamic behavior would affect utilization at the LMStr). The effect is further lower utilization and pollution of the LMStr.
LMSTR PERFORMANCE
In this section, we explore the performance of different applications running on a multicore processor with an LMStr and/or caches. For the mentevo benchmarks, we map the whole application to the LMStr and we compare the performance with a cache only configuration. For matrix multiplication, the variables according to their types while storing a specific type of variables in the scratchpad for the matrix multiplication benchmark. When storing local variables in scratchpad and the rest of the variables in the cache, the total data movement to/from main memory reduces by 6-9% compared to storing all the variables in the cache. However, data movement reduces with the core counts for local and temporary variables mapped to the scratchpad memory. On the contrary, data movement increases by more than 200% while mapping the arrays to scratchpad as our block generator was unable to exploit the locality of the arrays' access pattern in matrix multiplication. As a consequence, while mapping the entire application to the scratchpad, the data movement increases to/from main memory. Figure 9 shows the miss rate of the cache while storing different types of variables in the cache for matrix multiplication. The miss rate becomes around 0.01%, when the array is mapped to the scratchpad. The access pattern for the array in matrix multiplication pollutes the cache which increases the miss rate. However, the miss rate stays consistent with that of the cache while local and static variables are stored in scratchpad. The miss rate of the cache slightly reduces as the core counts increases while local or temporary variables are mapped to the scratchpad.
LMStr Performance for Matrix Multiplication
LMStr Performance for Mantevo mini-applications
The data movement between the LMStr and main memory increases almost for every mantevo miniapp while all the data are stored in scratchpad compared to cache alone as in Figure 10 . Typically, the size of the L3 cache is 3 MBytes which is 24 times the size of the LMStr. However, as LMStr is proposed to store only localized data the data movement does not increase with size. Another, important performance metric is data movement between the cache hierarchy that can be compared with the LMStr and main memory. This metric is shown in Figure 11 . MiniMD has more than 40% less data movement compared to cache alone. However, all the other benchmarks have significantly less data movement compared to cache. Our experiments show that the data movement tends to increase as the number of cores increase for LMStr.
One of the important performance metrics is the memory access time of the LMStr which is shown in Figure 12 . Therefore, we introduce a new performance metric Cycle Per Memory Access (CPMA) that defines the required average cycle to resolve a memory request. All of the benchmarks show significant reduction in memory access time compared to cache alone. However, the access time increases as the core count increases. The results show that the maximum gains in access time are achieved when the local variables are mapped to scratchpad memory. Figure 13 shows the estimated energy consumption of LMStr and compare it with conventional cache. The required energy(E T ) in the cache hierarchy comes from two sources. First, energy required to read/write data in cache/LMStr (E o ) by CPU. Another source is the energy to write a cache line or allocate a data block while data moves in the cache hierarchy (E d ).
Our estimated results show that for the Mantevo applications, the energy consumption for LMStr is lower than 20% of the cache hierarchy. Furthermore, the LMStr energy consumption increases with the number of cores in the system. Table 5 summarizes the area overhead for LMStr compared to cache only. The area increases when the system employs both LMStr and cache together. However, the area decreases by more than 95% if we only use LMStr without a cache as the only on-chip memory.
However, the number of instructions for an LMStr multicore increases compared to a multicore with only caches due to moving data in blocks. In Figure 14 , we show the percentage of instruction increases compared to cache alone while storing some variable types or while storing the whole application's data. We can see local, temporary, and static variables have a very little increase in instructions in LMStr. However, a significant increase in the number of instructions while storing the whole application data. Moreover, the instruction count increases as the core counts increases.
RELATED WORK
In this section, we present the related work on SPM use for multithreaded applications. Most of the previous research on SPMs focused on efficient algorithms to identify critical variables or instruction groups to store in the scratchpad [3, 28, 31, 37, 41] . Some of the work focused on dynamic control of the scratchpad by hardware and algorithm extensions [9, 15, 20, 27, 28, 35] .
Some of the research proposes complete software-only scratchpad management solution to store only stack data. Shrivastava et al. [33] propose to map the stack data in SPM and rest of the data in cache. They insert stack management instructions in the application binary where needed and thus, achieve 32% energy reduction compared to hardware SPM.
Bai et al. [6] [7] [8] proposed limited local memory (LLM) which consists of only SPM as on-chip memory. LLM consists of a private SPM per core, where one of the cores is the master and the rest are worker cores. The master core assigns the jobs to the worker cores. At compile time, the program is divided into several jobs (function, stack, and heap), required SPM space for the jobs have been identified and SPM movement instructions get inserted. Then, at runtime, targeted job (actually data) gets moved to the cores supervised by the master core. They achieve 14% runtime reduction while heap is targeted for SPM [6] and 14.3% less space while stack is mapped to SPM [8] . However, the data not assigned to SPM is directly sent to processing units (cores) which incurs a lot of performance penalties.
SSDM is another no-cache architecture proposed by Lu et al. [22] . They use weighted call graphs (WCG) to insert function data placement in code. However, they examined the algorithm with stack data where whole stack data for the function is moved all at once which introduces unnecessary data movement but reduces the overhead by 13% compared to other software algorithms.
Also, significant research on heterogeneous memory architecture consisting of SPM and cache has been pursued as well. Cong et al. [11] identifies the optimum proportion between SPM and cache space. They dynamically move blocks from SPM to cache according to data utilization and achieve reduction in energy-runtime by 18-25% compared to non-adaptive heterogeneous memory. However, the movement between SPM and cache introduces consistency issues and needs coherence mechanisms [1, 2] .
Alvarez et al. [2] propose generating code to update stale data. Hardware is responsible for diverting to the updated data. This technique achieves 1.14x speedup, 17% energy reduction, and 29% traffic reduction compared to a conventional cache system. Also, Alvarez et al. [2] propose another coherence mechanism for hybrid systems. They use directory and guarded instruction for inconsistent memory accesses. This mechanism achieves 38% speedup and 27% energy reduction compared to a cache system. On the other hand, Modified Integer Linear Programming SPM job mapping and scheduling mechanisms have been proposed to efficiently use SPMs in MPSoC architectures [38, 43] . However, they limit parallelism and keep a single copy of data to eliminate coherence problems.
Guo et al. [12] divide programs into parallel regions that can run in parallel in multicores. They use a private SPM in each core that is virtually shared. This mechanism assigns tasks statically in fixed cores according to task/region graph and reduces memory access by 17% and energy by 18%. Guo et al. [13] propose a data allocation strategy for multicore systems to achieve fast access as well as avoid coherence issues by storing only read data in multiple local SPMs in the requesting cores. Their profile guided mechanism achieves speedups around 33-40% compared to shared SPM among cores.
Liu et al.
[21] design a three multi-level SPM with different configurations based on cache and scratchpad placement. Their compiler guided data allocation strategy reduce energy by 31% compared to single level scratchpad memory. Lu et al. [23] propose a cost function to assign codes in scratchpad memory to improve speed of program execution. They improve the runtime by 20-80% compared to cache.
Komuravelli et al. [17] proposed a hybrid architecture (STASH), that tries to achieve the benefits of both SPM and cache. They keep the SPM globally visible through mapping the local SPM with a global directory. They test the performance with kernel and application data and achieves 13-27% speedup compared to cache and SPM as well as 35-53% energy reduction compared to cache and SPM.
In our previous works, we estimated the performance of LMStr for single core [34] and multicores [35] . However, this work differs with previous works by presenting power estimation and detailed data movement between memory hierarchies. Additionally, we present compiler work in detail in this paper. Our estimation shows that LMStr less than 20% energy compared to conventional cache hierarchy.
Our design has some fundamental differences from all of the previous works. Our design provides a new methodology for using SPMs. We do not focus on specific variables only, rather we focus on storing the programmer or compiler suggested variables. As, programmer suggest the variable or type of the variable, the complexity of our approach is much less compared to others approach. Our design is a generalized solution for any variable type in a multithreaded application. We make blocks with same type of variables that eliminates un-necessary insertion of coherency procedures for most of the data blocks (i.e., only requires for static and global variables). Further, keeping a single copy of same data eliminates the necessity of critical coherence mechanism. Moreover, we propose pure hardware management for the storage space while maintaining the need for compiler support. Our design maintains the storage virtually shared among the core of the processor that eliminates the use of any complex task assigning and scheduling procedure. Additionally, we suggest to divide the program in function or basic block boundary that reduces the complexity of compiler.
CONCLUSIONS AND FUTURE WORK
In conclusion, there is a current pressing need for new on-chip memory architectures that can alleviate the problems of conventional caches. Current cache hierarchies suffer from many disadvantages such as poor utilization, high power consumption, and increased data movement to/from main memory. With technology moving from multicores to manycores, such disadvantages are further exacerbated. We propose a hardware-controlled software-assisted on-chip shared scratchpad memory (LMStr) alongside conventional caches. Our results show that LMStr can achieve 5-20% reduction in data movement while storing temporary or local variables in LMStr. Moreover, for almost all of the applications, LMStr achieves up to 40% reduction in cycles per memory access and 85% reduction in energy consumption compared to caches only.
The core focus of this paper is on using an SPM as a shared onchip storage for multicore processors. The performance of LMStr in a multicore is particularly dependent on managing the shared storage among multiple multithreaded processes. The compiler can work without programmer assistance but the programmer can provide hints to the compiler e.g., requesting certain variables to be allocated in the LMStr directly. Although, a compiler is a fundamental component for successfully using LMStr since it identifies and manages data blocks in the LMStr, we only have implemented a block generator that generates blocks on the basis of data access locality.
In the future, a full compiler implementation and a cycle accurate simulator are the main thrusts moving forward. Also, We plan a distributed design and implementation of LMStr where to avoid serialization due to the LMEngine being a central control unit. Furthermore, we plan on investigating the security features that LMStr can provide to modern architectures.
