Abstract-Processor hardware has been architected with the assumption that most data access patterns would be linearly spatial in nature. But, most applications involve algorithms that are designed with optimal efficiency in mind, which results in non-spatial, multi-dimensional data access. Moreover, this data view or access pattern changes dynamically in different program phases. This results in a mismatch between the processor hardware's view of data and the algorithmic view of data, leading to significant memory access bottlenecks. This variation in data views is especially more pronounced in applications involving large datasets, leading to significantly increased latency and user response times. Previous attempts to tackle this problem were primarily targeted at execution time optimization. We present a dynamic technique piggybacked on the classical dynamic binary optimization (DBO) to shape the data view for each program phase differently resulting in program execution time reduction along with reductions in access energy. Our implementation rearranges non-adjacent data into a contiguous dataview. It uses wrappers to replace irregular data access patterns with spatially local dataview. HDTrans, a runtime dynamic binary optimization framework has been used to perform runtime instrumentation and dynamic data optimization to achieve this goal. This scheme not only ensures a reduced program execution time, but also results in lower energy use. Some of the commonly used benchmarks from the SPEC 2006 suite were profiled to determine irregular data accesses from procedures which contributed heavily to the overall execution time. Wrappers built to replace these accesses with spatially adjacent data led to a significant improvement in the total execution time. On average, 20% reduction in time was achieved along with a 5% reduction in energy.
I. INTRODUCTION
Non-spatial data access has been a leading contributor to memory access latency in most applications, resulting in increased program execution times and lowered response times. The primary contribution of this work is the development of an effective approach to reduce the data access time in most commonly used applications. The negative impacts of non-spatial data access have been lowered by creating a Dynamic Dataview of spatially adjacent data, which replaces these irregular data access patterns at runtime. The dynamic binary optimization capabilities of HDTrans [1] [2] have been leveraged to help achieve this goal. This study also evaluates the performance of three data stores that host the dynamically shaped data -the Dynamic Data View Array(DDVA), Tagless D-Cache and Scratchpad Memory. This implementation also helps reduce the energy consumption for data access. The applicability of this scheme to various common applications in the SPEC2006 benchmark suite [3] were studied.
Hardware implementation tends to be much simpler for a linear layout of address spaces. It is for this reason that most commercial processors have a spatially linear view of data. Applications, on the other hand, involve extensive use of optimal algorithms, tailor-made for program efficiency. This often results in a spatially non-adjacent view of data from the application. This mismatch between the processor's view of data and the algorithm's view of data results in several performance bottlenecks such as an increased memory access latency, increased program execution times, increased memory bandwidth and greater power consumed for each data access. This performance loss is much more obvious in emerging applications, due to a magnified mismatch resulting from the highly non-spatial algorithmic data views.
II. RELATED WORK
Our Data Shaping approach seeks to reduce execution time and energy by dynamically emitting spatially-adjacent data from runtime data stores. Earlier work focused on overlapping data access with computation by introducing newer softwarebased cache designs for non-blocking, prefetching, identifying and storing frequent instructions, and also for managing spatial and temporal locality through independent parts. Some techniques were aimed at optimizing specific data structures in pointer-based recursive applications and in those with array references. Code transformations, runtime data and iteration reordering-transformations and some interleaving schemes to reduce DRAM row-buffer conflicts were also targeted in other related work. Even though some of these approaches may be successful in reducing memory access latency and execution time, they fail to address the relatively high energy use of these applications.
Non-blocking caches and prefetching caches [4] are two techniques for hiding memory latency by exploiting the overlap of processor computations with data accesses. A non-blocking cache allows execution to proceed concurrently with cache misses as long as dependency constraints are observed, thus exploiting post-miss operations. A prefetching cache generates prefetch requests to bring data in the cache before it is actually needed, thus allowing overlap with pre-miss computations. There are also some hybrid approaches that combine the benefits of both these schemes.
Another work, as described in [5] proposes code transformations to increase parallelism in the memory system by overlapping multiple read misses within the same instruction window, while preserving cache locality. This approach claims to deliver execution time reductions averaging 20% in a multiprocessor and 30% in a uniprocessor due to significant increases in memory parallelism.
A software controlled prefetching scheme targeted towards pointer-based applications with recursive data structures has been been proposed in [6] . This method claims to help achieve a 45% improvement in execution time. A HotSpot instruction cache has been proposed in [7] that identifies frequently accessed instructions dynamically and stores them in the smaller L0 cache. This approach helps achieve a 52% reduction in instruction cache energy without performance degradation.
III. BACKGROUND AND SIGNIFICANCE
The following subsections discuss the significance of Spatial and Temporal Locality in helping reduce access latency, Dynamic Binary Optimization as an effective solution to address these issues and the internals of the HDTrans framework as used in our implementation.
A. Spatial and Temporal Locality
The two classical attributes of data represented as linear memory mapped data views are spatial locality and temporal locality. A processor has to pay the cost of 40-100 cycles for fetching a data item from memory the first time it is seen. However, all future accesses resulting from temporal locality end up costing just one to two cycles if the data item is cached on the on-chip L1 cache on the first access. Data is fetched into L1 cache in a chunk of multiple data items, called a cache block in order to amortize the costs of multiple accesses.
The memory bandwidth has always been a show-stopper in computer architecture -hence the popular term memory wall. Exploitation of locality is the primary mechanism to overcome the memory wall. Dynamic data shaping takes this a step further and rearranges data along a view ideal for the current program context. This helps speed up the program execution by a factor of 10 or more. Platform independence of these optimizations makes them very appealing for performance enhancement.
B. Dynamic Binary Optimization
Binary Translation (BT) is a technique to convert binaries available in one ISA into another ISA [8] . Binary Translation can either be static or dynamic. Static BT (Static Binary Translation) [8] performs interpretation, which happens one instruction at-a-time. Dynamic Binary Optimization (DBO) [9] performs dynamic translation and execution of application binaries by actively carrying out runtime code instrumentation.
C. HDTrans -Architectural Overview
HDTrans [1] [2] performs IA-32 to IA-32 binary translation with very simple and effective translation techniques. It is a very lightweight system and also uses established optimizations such as trace linearization and code caching. HDTrans is the Dynamic Binary Optimization framework used in our implementation primarily due to its modularity, simplicity, resourcefulness and open-source nature . HDTrans executes in a coroutine fashion with the binary image of the application to be translated. It maintains basic blocks, which are a sequence of straight-line instructions bracketed by branches. These blocks are translated into a Basic Block Cache(BBCache). Other existing translators follow a complex translation strategy with intermediate code generation, trace optimizations and register reallocation. In these translators, most of the execution time is spent in the code cache, thereby reducing the benefits gained from translation significantly. HDTrans avoids these pitfalls by avoiding intermediate code generation, target code optimization and register re-allocation.
IV. DESIGN AND IMPLEMENTATION
The code block in Figure 1 is a snippet of a linked list access function, where data of all nodes are being accessed in a repetitive manner. Note that the accessed data in the DS0 linked list above has no spatial locality. The high level meta-wrapper or datashaper specification proposed in this work could take on a form as shown in Figure 2 . The wrapper is a small piece of code which transforms the architecture dependent data view (irregular accesses of Figure 1 ) into an algorithm amenable data view (potentially performance and energy optimized accesses) through temporary storage structures, referred to as the Dynamic Data View Array. In Figure 2 , a simple linear array (buf [j] ) is used to coalesce data from the structure DS0 and any further reference to DS0 is piped to the DDVA structure, buf [j] . Accesses to more than one field of complex data structures could be transformed to one or more DDVA structures.
/ * I t e r a t i v e s t r u c t u r e f o r t e m p o r a l l o c a l i t y
* / f o r ( i =0; i < k ; i = i +1){ / * L i
n k e d L i s t d a t a a c c e s s l o o p
Our work involves identifying procedures in benchmarks with significant non-spatially adjacent data access and with notable contributions to overall execution time, integrating the HDTrans dynamic binary optimization framework with the targeted applications and building a contiguous dynamic dataview of such non-spatial data. Significant efforts were also channeled towards modeling the performance of three efficient Table I . 
/ * Wrapped s c o p e w i t h t h e Data S h a p e r
* / f o r ( i =0; i < k ; i = i +1){ / / DATA WRAPPER / * Temporal L o c a l i t y − F i r s t Epoch * / i f ( i == 0) { / * L i
Phase Implementation Activities
Profiling SPEC2006 benchmarks were profiled to identify procedures with non-spatial data access and those which contribute significantly to the overall execution time.
DBO Framework Integration
The HDTrans Dynamic Binary Optimization framework was integrated with the selected benchmarks to enable basic block creation and program optimization at runtime.
Data Shaping A Dynamic Data View Array(DDVA) was created to cache frequently accessed non-spatial data to emit spatially adjacent data replacing irregular access patterns at runtime.
Performance Evaluation Framework
An evaluation framework that provides an effective measure of execution time and data access energy for the original and data shaped benchmarks was designed.
The later sections describe these phases in more detail.
A. Profiling
The preliminary phase of our work involved rigorous efforts towards profiling various applications of the SPEC2006 benchmark suite and identifying procedures or functions with a significant amount of non-spatially adjacent memory access involved in their computation. It was also ensured that the target functions chosen had contributed significantly to the overall execution time of the application, so that noticeable increases in overall performance can be observed by subjecting them to the data shaping process. It was also ensured that gccbased applications were chosen, to avoid any possible compatibility issues with the gcc-targeted dynamic binary optimization framework, HDTrans, used in our implementation. gprof [10] , a widely used linux-based profiling tool was utilized to profile various applications from the benchmark suite.
After extensive profiling and analysis for non-spatially adjacent data access, the mcf and h264 ref benchmarks were shortlisted to be targeted for optimization using our data shaping approach. More elaborate information on the chosen benchmarks, along with their targeted procedures, and their contribution to the overall execution time is detailed in Table  II.   TABLE II 
B. Overall Architecture
The overall architecture used in our implementation in conjunction with Dynamic Binary Optimization is as shown in Figure 3 . Our implementation evaluates three models, namely, the Dynamic Data View Array (DDVA), Tagless D-Cache and Scratchpad memory. All of these three models ensure that the original architectural storage locations do not change. 
1) Dynamic Data View Array:
We introduce a new data structure called the Dynamic Data View Array (DDVA), which stores data in a linearly spatial manner. The DDVA caches frequently accessed data views. Addresses in the wrapped block of code are patched to point to DDVA for future access. The index into the DDVA will be stored in the wrapper state when the wrapper engine decides to allocate a DDVA for a wrapped code. The wrapped code block in Figure 4 generates data access addresses A i1 , A i2 , . ...., A iN referred to as the algorithmic data view, which may have no spatial locality. A cache line fetched to service a cache miss for address, Aij, may incur energy costs of tag access and also energy cost of wasted data bandwidth since only a fraction of the data in the cache line may be accessed due to lack of spatial locality. The DDVA in figure 4 is a mapping This implementation models these dynamic data views using linear array accesses to enforce spatial locality in an otherwise poor spatial data access pattern. Data access instructions in the wrapped code are modified to access data from the DDVA and emitted into the basic block code cache. Note that although DDVA is mapped in main memory address space, it can be cached through cache hierarchy levels in the transparent manner.
2) Tagless D-Cache:
Our implementation also models a Tagless D-Cache to serve as a source of spatially adjacent data. This implementation is loosely based on a similar approach targeting instruction fetch from a tagless I-cache, as detailed in [11] . This method sought to deal with spatially non-adjacent instruction references by replacing these with tagless I-cache references. So, a tagless D-cache design can be found to be similarly effective in dealing with non-spatially adjacent data references. Tagless cache design for data reduces cache tag comparison energy by exploiting spatial and temporal locality of accesses. Data access locality at basic block granularity can be profiled and frequently accessed basic blocks are aggregated into specially marked pages. Data in such pages can be accessed with out tag comparison in D-cache, thus reducing energy consumption. This D-cache approach for data accesses reduces the L1 cache access energy significantly.
The Tagless data cache (TDC) models these dynamic data views using linear array accesses to enforce spatial locality in an otherwise poor spatial data access pattern. The wrapper state is modified to include a pointer to index into the TDC as shown in Figure 5 . Data access instructions in the wrapped code are modified to access data from TDC and emitted into the basic block code cache. Architecturally, some banks of the cache can be flagged to be tagless. The cache controller then knows that the address mapping of the entire bank is guaranteed to contain a single address prefix (tag). The access time of such a bank is no different than the tagged cache access, but it consumes less energy. We maintain a virtual time counter to count the access time of these accesses based on a Cacti reported model.
3) Scratchpad Memory:
Scratchpad Memory(SPM) is a high speed local memory store used for temporary storage and rapid retrieval of data. Scratchpads don't contain a copy of data stored in the Main Memory and have Non Uniform Memory access latency.
In this implementation, Scratchpad Memory provides quick data access times and also reduces data access energy. Scratch- pad sizes of upto 1 kB can be supported in this implementation for data access without overheads. Once again, a virtual counter maintains the access times for all scratchpad data view accesses based on a Cacti derived model. Note that in reality, these stores -tagless cache and scratchpad are maintained within main memory within our DBO environment. However, energy and time for these accesses is modeled as if they were implemented architecturally.
C. Data Shaping
Data Shaping involves replacing frequently accessed nonspatially adjacent data with data from a dynamically-built spatial dataview, the DDVA, at runtime. This is achieved by identifying such non-spatial data access regions in the target procedure and placing special wrapped region function calls around them to make the HDTrans system aware of the data access instructions to be replaced at runtime. A linear spatiallyadjacent dataview which copies over the non-spatially accessed data is defined between a set of wrapper region function calls. Frequently, this wrapper region region also incorporates the alternative logic to replace the wrapped region logic with. These are in the form of x86 instruction opcodes to be emitted at runtime, since the code is already in a compiled state when this data access swapping occurs.
D. DBO Framework Integration
The HDTrans Dynamic Binary Optimization environment needs to be setup prior to invoking the wrapper and wrapped region system calls. Once invoked, the HDTrans system remains active across multiple runs of the resident function. HDTrans provides APIs to copy over the wrapped code and modify the code, dependent on the user's needs. HDTrans also provides dedicated control-transfer system calls, which transfer control to either the modified wrapped code or to the original unmodified wrapped code, depending on the more efficient flow. Special care needs to be taken while placing absolute jump instructions and system calls inside the wrapper code. This is because, the jump offsets need to be relative to the current position inside the bbCache.
E. Performance Evaluation Framework
This work also involved designing multiple performance modeling frameworks, to enable accurate tracking and analysis of time and energy. Using these frameworks, average time and energy data was collected across multiple runs of the mcf and h264 ref benchmarks for multiple sets of inputs. These frameworks are described in more detail in the following sections.
1) Execution Time Framework:
The execution time framework is based on the read timer system call [12] belonging to the PMU library. It internally uses the rdtsc primitive to get the running count of the number of clock cycles elapsed. The difference in the number of clock cycles was analyzed, both at a wrapped region granularity, and at an application-level granularity. This difference was used to compute execution time in terms of the number of seconds taken for a specific processor frequency.
2) Energy Framework:
The energy framework is based on the Cacti [13] model of computing the energy consumed for accessing data referenced from different types of memory for specific cache attributes such as the Cache size, Block Size and Associativity. Cacti provides an accurate measure of the energy consumed for accessing the tag and data sections of a cache line. Our Data Shaper based implementation significantly reduces the need for tag comparisons. The energy needed for accessing such spatially adjacent data is effectively the same as that taken for accessing the data section of the cache line. The original benchmark, on the other hand, has much higher access energy due to significant contributions by both data and tag accesses.
V. RESULTS
The Performance Evaluation Framework was used to capture the execution time and energy statistics for both the original SPEC2006 benchmarks, as well as those subjected to the Data Shaping process. The mcf and h264 ref benchmarks correspond well to the requirement of having significant nonspatially adjacent data access. The behavior of these benchmarks was studied extensively for varying inputs. The results observed are detailed in the following sections.
A. Execution Time
The execution time performance of the mcf and h264 ref benchmarks before and after the data shaping process for input datasets of different sizes is as shown in Figures 6 and 7 respectively. The data shaping was done using a DDVA-based data store. This execution time was obtained after computing the number of cycles elapsed using the read timer system call from the PMU library. It can be seen that an average of 20% reduction in execution time was observed.
Execution Time across the three models of the default Dynamic Data View Array (DDVA), Tagless D-Cache and Scratchpad Memory (SPM) was modeled for the h264 ref benchmark for different input dataset sizes as shown in Figure  8 . This execution time was computed from Cacti for each read access from the various data stores and then modeled for the entire application.
B. Energy
The energy performance of the mcf and h264 ref benchmarks before and after the data shaping process for input datasets of different sizes is as shown in Figures 9 and 10 respectively. The data shaping was done using a DDVA-based data store. The total dynamic read energy per access was obtained from Cacti and then used for modeling the benchmark's energy performance. It can be seen that an average of 5% reduction in energy was observed.
Access Energy performance across the three models of the default Dynamic Data View Array (DDVA), Tagless D-Cache and Scratchpad Memory (SPM) was modeled. The access energy for the h264 ref benchmark for different input dataset sizes is as shown in Figure 11 . This overall energy was also computed by using the Cacti read access energy for the various data stores and then extended to the entire application. It can be observed that the Tagless D-Cache model had the lowest energy consumption when compared to the DDVA and Scratchpad Memory. Elimination of tag comparisons and a lower latency in accessing data are the primary reasons for this reduced energy.
VI. CONCLUSION
This work has demonstrated the effectiveness of Data Shaping by utilizing data stores like the DDVA, Tagless DCache and Scratchpad Memory to cache the most frequently used non-spatially adjacent data accesses in a linearly adjacent manner. This work has demonstrated significant reductions in execution time and data access energy in some commonly used SPEC2006 benchmarks. This implementation effectively eliminates the shortcomings of non-spatial data access by replacing such patterns in hotspots of applications with spatially adjacent data from the modeled data stores built at runtime. Execution time improvements by 20% and access energy improvements by 5% illustrate the efficiency of this approach over earlier work. This implementation would be very valuable in scenarios where runtime optimization is needed without adding any additional static overheads.
Future work could involve building a utility to dynamically identify regions of non-spatial access and temporal locality to serve as hotspots for optimization. The scalability of this data shaping process to newer spatially adjacent data stores proposed in research literature could be studied. The positive contributions of Data Shaping towards improving several other system parameters, such as memory bandwidth could be explored. Also, the effectiveness of the HDTrans framework in supporting performance enhancements in widely used applications, such as the Wikimedia suite, Amazon public data sets and various social networking applications could be studied.
