While SystemC models provide a promising solution to the complex problem of HW/SW co-design within the systemon-chip paradigm, such requires a detailed annotation of transaction level energy and performance data within the model. While this data can be obtained through source code profiling of an application running on the target processor, accomplishing such when the target CPU hardware is not actively available typically requires time-consuming CPU simulation, which is often too slow to practically consider for large programs. Additionally, while the use of SystemC modeling with TLM 2.0 standard is widely adopted for the SoC modeling, the process of transforming C/C++ code to SystemC code with TLM 2.0 functionality remains nontrivial. Herein we propose an automated framework that: 1. Enables high speed code-specific CPU profiling support for both Sniper and gem5 using parallelized dynamic steady state phase convergence modeling, providing automatic annotation of energy and latency within source code. 2. Provides an automated C to SystemC TLM 2.0 code generation flow that utilizes the back-annotated source code to produce a SystemC module for seamless incorporation into the virtual prototype. Maximum speedups obtained using Sniper and gem5 are 105.78x and 562x respectively, while average results obtained speedups of 42.7x and 323.1x. Sniper results maintain an average accuracy of 0.64% for latency and 0.10% for energy, while gem5 achieves average accuracies of 4.16% and 2.87% for latency and energy respectively.
INTRODUCTION
Determining an efficient partitioning between hardware and software generally requires accurate means of modeling the performance characteristics associated with running the customized software on the target platform [12] . While such modeling can occur at the RTL or even gate level, simulation times associated with these methodologies are often too slow due to the complexity associated with the low level design [1, 3] . For realistic applications, higher level SystemC Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org.
ICCAD '16, November 07 -10, 2016, Austin, TX, USA models of hardware provide a promising solution through the abstraction of low level complexity while preserving the relative accuracy necessary to enable design decisions. However, running customized code on fully detailed CPU SystemC models incorporated into virtual prototypes (the highlevel modeling stage of the SoC system before committing to a physical prototype) remains excessively time-consuming when considering large programs. A promising alternative involves the generation of application-specific SystemC modules which execute at host speed within a virtual prototype and require only the energy and performance information of code regions considered for HW/SW codesign. The process of extracting performance data from a lower, accurate representation and returning this information for incorporation into the corresponding high-level model is known as backannotation. Back-annotation enables a natively inaccurate higher level model to inherit the effects of details available only at the lower level, thereby resulting in the increased accuracy of the higher level model while maintaining faster run-times. Current means of extracting performance and energy data for back-annotation generally involve a detailed CPU simulation of source code, the entirety of which is often too slow for practical consideration for large programs.
We herein present dynamic phase convergence modeling in conjunction with parallelized region-of-interest (ROI) profiling as a technique for improving the efficiency of performance extraction for SystemC modules. While several techniques already exist for achieving significant speedup with regard to code profiling, such generally operates at a granularity level that is inappropriate for the back-annotation process. Our technique specifically targets applications with iterative code constructs commonly considered for candidacy within custom accelerators, which also require a significant majority of simulation time. We further provide a C/C++ to SystemC framework whereby the automatically back-annotated source code is ported into usable SystemC modules with TLM 2.0 support. These modules can be integrated into virtual prototypes providing an effective means of conducting early stage hardware/software co-design. In summary, our work contains following novel contributions:
1. An automated framework providing annotation of energy and performance within source code through codespecific CPU profiling support for both Sniper and gem5 using parallelized dynamic steady state phase convergence modeling attaining average speedups of 42.7x and 323.1x respectively.
2. An automated SystemC TLM 2.0 code generation framework to generate SystemC for the C code to enable seamless incorporation of the software modeling to the SoC virtual platform.
BACKGROUND AND RELATED WORK

Software Modeling Strategies
We herein discuss several existing simulation strategies for the effective acquisition of energy and performance data.
Simpoint [5] utilizes advanced code analytics to achieve high level phase modeling that enables accurate estimation of aggregate performance statistics of large programs. While useful in providing complete run times of large programs, the methodology is not well suited in determining performance results within fine-grain sub-regions, as required for meaningful SystemC back annotation. Our target granularity is several orders of magnitude too small for our Simpoint application, noting specifically that the minimum interval of a single Simpoint sample is generally larger than the entirety of the regions that we are attempting to classify.
Region-of-Interest (ROI) code profiling [6, 2] acquires performance data associated with only a target region of code while preserving correctness relative to the entire program. Detailed simulation or performance estimation of the entire program is often unnecessary, especially in instances of hardware/software co-design in which regions such as variable instantiation and assignment can be ignored as only the computational region of code is under consideration. Region-of-Interest code profiling relies on simulator capabilities for efficient fast forwarding and cache warm-up, resulting in significant speedup only when the large portions of the program do not require detailed simulation and can be skipped.
Sampling methodologies such as SMARTS [13] use statistical sampling over a provided range to characterize code regions. Within [10] the authors introduce a new hardware accelerator enabled method of quickly fast forwarding through non-meaningful regions of code. This drop in fast forwarding timing overhead enables the development of the pFSA sampling methodology, in which sequential code can be sampled in parallel at the expense of simulation redundancy, with the results aggregated into a single profile. However, results obtained remain sequential with respect to time while ultimate speedup is tightly bound by the degree of parallelism.
Our simulation methodology seeks to combine the techniques of pFSA simulation with dynamic phase modeling analysis through the use of parallelized non-adjacent region of interest profiling. Using active feedback, we seek to acquire simulation speedup through profiling only the selected region-of-interest required by our model to achieve relative phase convergence.
Simulation Tools
Within the academic community, Sniper [6] and gem5 [1] are widely adopted simulation tools. Sniper is a high-speed x86 simulator that provides modes of operation for fastforwarding, cache-warming, and detailed profiling. Sniper is highly attractive due to its native support for iterative ROI simulation and incorporation of power characterization software (McPAT [7] ). Gem5 offers cycle-accurate simulation for a wide range of ISAs and provides highly customizable hardware configurations. Gem5 also provides several simulation modes of operation, with a notable recent addition of hardware enabled fast forwarding [10] . Nevertheless, detailed profiling of large programs within both simulators remains a time consuming process.
Transaction-level Modeling
Transaction-level modeling (TLM) [4] is a popular method to model system communication. In TLM, the communication blocks are decoupled from computation details, thereby providing a uniform interface for different components in the system and transparently addressing the coherency issues caused by concurrent communication. TLM abstraction models capture the high-level data transaction and associated latency information while neglecting lower-level implementation details, and hence enable high simulation speed with necessary accuracy for system performance modeling and architecture selection. In practice, TLM is usually implemented in SystemC. In system modeling, a TLM channel and communication block are generated for each component, in a manner that enables the component to communicate with other blocks while keeping the functionality transparent. To achieve full system integration within the context of such a SystemC model, any profiled CPU requires the generation of a TLM wrapper. However, by current standards, writing such a SystemC TLM wrapper that captures all possible CPU behavior remains non-trivial and requires tedious manual design effort.
Previous processor modeling frameworks primarily target the modeling of general purpose CPUs instead of targeted SoC applications. Since our goal is to design a processor modeling flow that can be used in an SoC platform, extending the software model with TLM wrappers becomes a necessity. Hence, within this work, we developed an automatic TLM block generator for the output of our software modeling, such that our back-annotated software models can be equipped with TLM wrappers, ready for direct integration into SystemC models used for hardware/software codesign.
METHOD
Our overall framework contains two stages, the first of which is Software modeling. Given C code as input, this stage estimates the energy and latency associated with this application running on a target platform, and outputs backannotated C code, which serves as the input of the second stage: SystemC and TLM 2.0 code generation. This stage wraps the C code into a SystemC module, which then can be directly plugged into a SystemC system modeling environment such as a virtual prototype. The user provides C source code with directives specifying the regions to be profiled. Code is then parsed and analyzed wherein each iterative region within the profiling area is further subdivided into multiple fine-grained regions of interest based on iteration scheme. Non iterative code blocks are arbitrarily assigned to a single region. Regions are independently isolated and prepared for simulation in a manner that is dependent upon the simulator used. The simulations of multiple ROI occur in parallel with results passed through McPAT to enable power profiling. Simulation is managed using a dynamic simulation thread launcher, which operates based on feedback provided by an n-way convergence algorithm to dynamically determine the subset of ROI that must be run in order to accurately predict significant shifts in phase behavior. Upon achieving convergence, the simulation phase is complete and results are automatically backannotated into the source code.
Software Modeling
Simulator Dependent ROI division
As the input and control of each simulator is unique, we describe the process by which target regions are independently separated for both Sniper and gem5 environments.
Sniper: Sniper provides directives to specify target regions of interest within both purely sequential and iterative constructs. Utilizing these tools, we generate a control script that initializes the simulation in fast-forward mode, switches to cache-only mode at a designated code location or iteration number, enters a detailed simulation at the target region of interest, and terminates upon completing the profile of the region of interest. For each region or interest, a custom source code is generated and compiled in conjunction with specific simulator commands that govern the behavior of the ROI. This source code in conjunction with control directives (in the form of a make file) is sufficient to enable independent parallel simulation of the target ROI.
Gem5: In order to isolate regions of interest while utilizing the hardware enabled fast forwarding, it is necessary to translate target locations within high level code directly into instruction counts. However, obtaining these instruction counts without contaminating the original program can prove challenging. We accomplish this by translating ROI directives into in-line assembly labels. After statically compiling and disassembling the code, we search for the designated labels as a means of obtaining the instruction address associated with all target code positions without modification to the compiled code.
To translate between instruction addresses and instruction counts we utilize PIN [8] . We generate a custom pintool that dynamically outputs the instruction count whenever the instruction addresses associated with the designated labels are executed on the host system. Combined with the code structural information obtained from parsing the source code, this is sufficient to enable translation between the beginning and end of each ROI and their associated instruction counts.
As neither Sniper nor Gem5 provides power data, power results are generated by McPAT. While exporting performance parameters from Sniper to McPAT is natively supported, such does not exist in gem5. Therefore, we utilize a script provided by R. Strong [11] to translate output statistics from gem5 to McPAT.
Phase Convergence
As a program executes, the dynamic state on the target platform can be categorized into phases. Significant phase changes can be caused by branch mis-predictions, misses within the various caches, or changes within the general control flow. When considering the effects of these phases upon performance characteristics with respect to a dependent marker of code progression, such as number of instructions or number of sequentially placed ROI, this results in a mathematically describable phase function. At fine granularities this plot can appear highly volatile and periodic in nature, but at coarser granularities the resulting graph is generally smoother as periodic or volatile events become averaged over the larger interval and the resulting plot typically resembles a piecewise function. So long as the sampling required to determine the function is less than the overall ROI space, a significant speedup can be attained by approximating intermediate regions.
N-way Convergence Sampling
N-way convergence modeling is the algorithic process (utilizing N concurrent threads) whereby our predictive model approaches the true behavior of the system. To describe this process we first consider the simplest case of sampling using a single thread. Here we first evaluate the upper and lower bounds of the provided region using a model to predict intermediary values. While convergence modeling is applicable to any arbitrary model, within the context of this paper, we choose to consider a linear stepwise function. We then sample the midpoint of the region, comparing the result to our predicted values. We then update the model to reflect the data obtained through sampling the midpoint. Assuming our model remains inaccurate beyond a given threshold, there are 2 regions of unknown values, one on each side of the midpoint, and the process repeats. Per iteration, the size of the target interval decreases by a factor of 2. Given the number of ROI to be finite, the algorithm will terminate either when accuracy threshold (based on design tolerances) is reached or interval size is 1. Since we define both as convergence, the algorithm is guaranteed to terminate successfully.
In extending the algorithm to n-way convergence, we introduce a constant n parallel threads all executing on the given initial region. While the classic divide-and-conquer approach might seem feasible, in practice this results in a significant waste of computation resources as some subregions converge much faster than other sub-regions, while other sub-regions require significantly more simulation time to obtain a single data point. To address this issue, we introduced the concept of asynchronous dependency thread scheduling, which is described in the next section.
Asynchronous Dependency Thread Scheduling
Parallel simulation requires the proper management of thread behavior on the host platform. Our initial implementation of n-way search convergence modeling employed the technique of synchronous batch scheduling, in which all threads were tasked with one region of interest per batch and proper ordering was maintained through synchronization points. Unfortunately, we found the load-balance variation could extend to several orders of magnitude, significantly limiting resource utilization and consequent speedup.
Our solution to the challenges associated with batch scheduling was the development of an asynchronous thread scheduler that tracked and maintained the dependencies inherent within our n-way convergence algorithm. By employing this algorithm, we ensure that thread updates to the convergence model must only wait upon threads with which they have a direct dependency. This is accomplished using a dynamic array of mutex semaphores that protect individualized access to the convergence models and scheduling queues associated with each iterative code block. It eliminates the need for global synchronization and instead allows for independent threads of various execution time to run concurrently without negative impact. Each iterative ROI block is assigned a dependency list that ensures that the next dependent instruction may, with no wait, update the model, receive a new assignment, and immediately begin execution of the new assignment.
The pseudo code is shown in Algorithm 1. Given error tolerance tol and input set of all ROI, U ncovergedROI, the algorithm outputs a converged model containing approximated power and latency for each ROI. We first identify the block within the unconverged ROI that currently contains the least number of assigned threads (line 4). We then request the next available interval within that block, identified by the upper and lower bounds (ub, lb), and receive an assignment to evaluate the interval midpoint mp (line 5). We log our dependency chain to preserve order correctness (line 6) and execute simulation to acquire power and latency (line 7). Upon return from simulation, we verify that all points within our dependency chain have also completed (lines [8] [9] [10] . We then determine model error (line 11), use the error to determine and mark convergence (lines 12-14), and calculate new approximate values of power and latency using most recent simulation data, updating the model over the assigned interval (lines 15-17). The process repeats until no unconverged ROI remain.
As a further optimization, we recognize that while our asynchronous scheduling algorithm provides full resource utilization, it can also result in excess simulation in conditions where speculative assignments are provided to a region that converges soon thereafter. We address this by providing a means whereby model convergence can track and kill speculative simulation processes currently executing within the specified region. Special care is taken to ensure that while the simulation process is killed, the thread remains active and ready to receive a new assignment.
Automated Back Annotation
Back annotation of performance data into the original source code is achieved through the automated generation of C++ header files. Results for each block of iterative ROI code are exported from our convergence model and assigned to designated arrays. Update functions are automatically incorporated into the original source code. The resulting source code when compiled will actively maintain current energy and latency of the system at ROI/iteration granularity. We note that in general, our flow considers energy and latency for the back-annotation process as both variables represent quantities that can be directly accumulated during the runtime of the target application. While power can be derived for any region through the division of energy by latency, power cannot be accumulated using direct summation and it is therefore preferable to consider energy instead. However, we note that within our flow, we intermediately consider power instead of energy in circumstances where normalization with respect to latency is useful. Specifically, we perform all phase curve analysis using normalized quantities, and, as a result, phase curve analysis is performed with respect to power. Prior to the back-annotation process, any units of power are converted into energy through multiplication with the corresponding latency. Nevertheless, at any time power and energy can be derived from one another, as the latency of the corresponding region is always known.
SystemC and TLM Generation
As described in Sec 1, SystemC together with TLM2.0 is a widely adopted methodology for SoC modeling and virtual prototyping. However, manually translating a C code into SystemC can be very tedious. Hence, to incorporate our software modeling output into the virtual platform automatically, we developed an automated SystemC and TLM 2.0 code generation framework, which takes C code as input, and directly transforms it into a component module written in SystemC equipped with TLM2.0 communication channels and functions. In particular, it achieves three goals:
(1) Enable seamless incorporation of the software model into virtual platform for effective system-level simulation and analysis. (2) Consider both communication and computation cost. The latency and energy associated with computation is derived from the back-annotated software model, while the communication cost is obtained from memory model and TLM channel. (3) Maintain the memory correctness in the SoC. In a real SoC, different components access memory and exchange data constantly. Therefore it is necessary to maintain the memory consistency among all the components. As our original software model is pure C code with no system-wide communication, one major feature of our SystemC and TLM framework is to enable data movement between the component and the rest of the system. Figure 2 shows the hierarchy of the generated SystemC modules and simulation environment. We generate two components: The Processor module and the Memory module. The Processor is SystemC code embedded with the output of software modeling flow. The Memory module is a SystemC code used to model the memory behavior, annotated with latency and energy associated with memory accesses. These two modules are connected via communication payloads. This environment can be extended to incorporate additional components to form a more comprehensive system. As the memory model is a separate (and standard) SystemC block used for simulation purpose, and as the main contribution of our framework is the automatic generation of processor modeling, we herein explain only the processor model in detail. In the computation part there are two major pieces. The SC MODULE() is a C++ class declaration which defines the interface and structure of the component. It generates and registers the communication socket with the transport function, declares the member variables and methods, and declares the thread function in the constructor and registers it with the SystemC scheduler kernel. The SC THREAD computation func() is the main function; once the simulation starts, this function continues running until sc stop() is called. The function body contains three parts: The read data() and writeData() are two communication functions, which are called before and after the computation to achieve atomic data movement. Note that once these two functions are called, the latency associated with computation part is collected and passed to these functions as a parameter, which is used in updating and synchronizing with the SystemC global scheduler. In the communication part, the framework creates two functions for data read and write, which are similar in structure. Considering writeData() as an example, it first generates the payload as a channel to pass necessary TLM 2.0 communication parameters to the target socket; then it generates the transport function, which directly binds to the socket and passes the payload parameters to the memory for communication. With the hierarchy of the generated code illustrated, it is easier to demonstrate the code generation framework flow. Figure 3 shows the block diagram. Given back-annotated C code as input,F the first step in our flow utilizes compiler analysis to identify the function structures as well as the memory variables in the C code and transform them into C++ member functions and variables. Once this is completed, in the second step we generate the SC MODULE using the information acquired in step 1. We then generate the communication TLM functions, which can be done independent of the original C code and this function can be reused in different modules. The next step is to generate the body of the SC THREAD. Here we need to insert member methods which are transformed from the original C code and insert required SystemC syntax such as super while(1) loop and sc stop() function at the end of the function. The next step is to insert the communication function calls, and communicate with the memory module. Finally, the memory model, top module, and the main() functions are generated to enable the complete simulation environment. This step can also be done once and generated files can be reused for different applications. In addition, we also provide extensive scripts support, enabling the designer to perform all code generation steps as well as simulation by a single command.
The Hierarchy of the Generated Code
The Code Generation Flow
EXPERIMENTS
Experimental results are collected on both Sniper and gem5. The primary purpose of our experimental results is simulation proof of compatibility with regard to our flow, the subsequent speedup and limited loss of accuracy.
Experimental Setup
Benchmarks. We evaluate our described flow using benchmarks from the Polybench Suite [9] . From the available Polybench benchmarks, we have chosen to implement Atax, Correlation, Covariance, Gemm, Gemm, Jacobi-2d, and Lu. Dataset sizes for each benchmark were chosen such that results could be obtained within the time limitations of this paper. Sniper utilizes dataset sizes of 4000, 1000, 1000, 4000, 512, 2000, and 1024 for each benchmark respectively while gem5 utilizes dataset sizes 8000, 1000, 2000, 8000, 1024, 2000, and 1024 for each benchmark respectively. Dataset sizes were chosen to enable reasonable runtimes of base simulations.
Simulator Configuration. Sniper: We maintain the default simulation environment inherent in the native Sniper, which instantiates a Xeon (Gainestown) duel-core processor running at 2.66 GHz. Although we recognize that this processor configuration was not originally designed for SoC applications, it remains useful within our proof of concept verification. Sniper experiments are conducted by utilizing 4-way parallelism on a host system with an Intel i7-4770K processor and 16 GB RAM.
Gem5: In using gem5, we likewise configure several simulation parameters, such as cache sizes, to their simulator default settings. We specify the CPU frequency to be 3.4 GHz, which is consistent with the McPAT template we obtained. Cache and functional warming are fixed at 3 million instructions. As KVM fast-forwarding support is currently limited to AMD chipsets, gem5 experiments are conducted by utilizing 6-way parallelism on a host system with an AMD FX-6100 processor and 8 GB RAM.
Variations in ROI Granularity. In order to demonstrate the variations in speedup associated with parallel ROI simulation with convergence modeling, we implement each benchmark with multiple levels of ROI granularity. For simplicity, in benchmarks with multiple blocks, we chose to limit our exploration space along a single dimension, meaning that modifications to the number of ROI blocks is applied universally to all code blocks. An exception is one code block within the Gemver benchmark that is only a single nested for loop, for which the total number of instructions and the corresponding simulation time are insufficient to consider for fine grain partitioning. For conciseness, results shown are provided for all benchmarks at ROI granularities of 1, 50, 100, and 500. While it would be ideal to provide results for all possible granularity divisions within the iteration space, this would prohibitively increase our experimental runtime by several orders of magnitude. Thus we have focused our experiments on variation of fine granularities, which is the target domain of both simulation parallelism and convergence modeling. In some cases we have included additional granularities to provide additional insight into general simulation trends.
Guided ROI Granularity. Preliminary results indicated
that ROI granularity has a dramatic effect on speedup, and in general, increasing the ROI granularity will result in increased speedup. However, in considering the results of Gemver as shown in Table 1 and Correlation as shown in Table 2, the increased granularity can be beneficial to a point, after which further increasing the granularity degrades the initial speedup. An analysis of the resulting phase curves indicates this slowdown can be primarily attributed to oscillations within the phase curve, which are graphically illustrated for Gemver within Fig 4. The effects of this oscillation can be removed by clustering the alternating regions into groups, resulting in a smooth curve that converges quickly. However, defining the ROI granularity corresponding to this clustering may be too abstract for the common user. Given differences in benchmarks, including variations in problem size and iterative structures, we note while there is no one-size-fits-all granularity that is ideal across all benchmarks, our flow extracts enough information regarding the target application to produce an educated guess based on a minimum instruction threshold. We provide a simulation mode that automatically subdivides each iterative region into the maximum number of divisions, while then ensuring that each division maintains a minimal instruction count. This is made possible through the analysis of the dynamic host-profiling output that provides a detailed map of ROI to instruction count correlations. This is particularly useful in instances where instruction counts between ROI can vary greatly, and results in a form of minimal load balances. We have chosen a minimum sub-division instruction count of 4,000,000 instructions, which corresponds roughly to the point at which periodicity corresponding to system events becomes prevalent within the resulting phase curves.
Variations in Warmup. Within Sniper, warmup of iterative code can be instantiated by specifying the iteration number corresponding to the point in code at which the processor transitions between the fast forward and warmup states. Given the size of the iterative constructs considered within the provided benchmarks, sufficient accuracy was achieved using a warmup phase consisting of a single iteration. As the number of instructions associated with a single iteration can vary across benchmarks, this implies that, within Sniper, the number of instructions associated with each warmup phase is benchmark dependent. Gem5 provides the ability to control warmup duration at the instruction level, enabling perfect consistency between benchmarks. Therefore, within the gem5 environment, we further consider the effect of warmup variation across multiple benchmarks. As combining variations in ROI granularity and warmup duration results in a multidimensional problem, we have considered expanding the warmup variation only in the instance of ROI granularity 500. For all other gem5 evaluations, default warmup duration is set at 3,000,000 instructions, which was decided on the basis of results obtained by [10] .
Results
Detailed results for our experiments can be seen in Table 1 for Sniper and Table 2 for gem5. Results show the max speedup as averaged across all benchmarks to be 42.7 for Sniper and 323.1x for gem5. Average latency error from Sniper is 0.64% while the average energy error from Sniper is 0.10%. Likewise the average latency error from gem5 is 4.16% while the average energy error from gem5 is 2.87%. Phase Curves. For illustration of the model output, we have provided phase curves associated with benchmarks simulated within the Sniper simulation flow. The phase curves corresponding with latency per ROI can be seen in Fig 4. In general, these phase curves demonstrate the benefits of linear step-wise convergence modeling, even though some benchmarks, namely Correlation and Covariance contain nonlinear behavior. Sharp edges within the phase curve represent either changes in control flow or system events such as cache misses scattered throughout execution. We do however note the limitations of our methodology as captured within the region in Gemver between ROI numbers 500 and 1000. At this ROI granularity, the phase curve is periodic in nature, which is inherently difficult to capture using only linear likewise convergence, limiting overall speedup at higher granularities as shown by Table 1 . Speedup. Results demonstrate that, in general, increasing the granularity of ROI partitioning results in a significant runtime speedup of the simulation profile. We note that the magnitude of the speedup obtained can vary across different benchmarks and simulators. Variation across benchmarks is due in part to the differing phase typologies, which directly affect the rate at which each benchmark converges to our model. Drastic changes in control flow or instruction composition result in sharp edges within the phase curve that require multiple iterations to resolve. Within Sniper, the max speedup obtained was 105.78x, achieved using the benchmark Jacobi-2d at an ROI granularity of 100. While higher speedups may exist at granularities not tested, time constrains to not permit an exhaustive sweep of all possible granularity levels. The max speedup of Gemver, 48.76x, was achieved at a granularity of 50 while Atax received a similar max speedup of 44.62x at a granularity of 500. Gemm, Correlation, and Covariance achieved their max speedups of 50.69x, 14.17x, and 18.25x respectively at the guided granularities. In considering all benchmarks, it is significant to note that there is no consistent granularity at which optimal speedup is obtained. Nevertheless, our guided ROI division, by considering maximal granularity with a threshold minimum instruction count, effectively achieves speedup that is either optimal or near optimal across all benchmarks. Across benchmarks considered Table 2 : Results obtained using gem5 simulation platform in conjunction with our flow. with Sniper, the average max speedup achieved is 42.7x. Similar speedup trends were found within the results of the gem5 experiments. Overall maximum speedup of 562x was obtained from the Gemm benchmark using the guided granularity. We likewise note that within gem5 our guided ROI division was also able to achieve optimal or near optimal speedup across all benchmarks. The smallest maximum speedup 30.26x gained occurred in testing Jacobi-2d, with the overall average maximum speedup equal to 323.1x.
# ROI
Speedup differences between simulators can be attributed in part to the magnitude of speedup provided by fast-forward and cache warming modes relative to the speed of detailed profiling. Sniper, although natively faster than gem5, does not offer the same speed advantage provided by gem5's hardware accelerated fast forwarding. As a result, we see that the benefit of increased ROI granularity can decrease after a certain threshold resolution. In considering why such occurs, we note that increasing the number of ROI partitions also increases the number of simulations required for reaching convergence. This in turn increases the number of times that prefix code (code preceding the ROI) must also be simulated. As the granularity of the ROI becomes finer, the time spent performing detailed simulation for each ROI decreases, and the ratio of simulation time spent executing this prefix code relative to the time spent profiling the ROI increases. Thus, increasing the granularity increases the cumulative overhead associated with partitioned profiling which counteracts and eventually diminishes speedup.
Accuracy. The accuracy of both latency and energy results obtained from Sniper is shown in Table 1 . Overall, the latency error ranges between 0.0% and 0.240%, with an average error of 0.10%. The energy error ranges between 0.0% and 1.369% with an average error of 0.64%. From Table 2 , we see that within gem5, the overall latency error ranges between 0.22% and 8.75%, with an average error of 4.17%. The energy error ranges between 0.16% and 6.50% with an average error of 2.87%. In general we note that the deviation of cumulative latency and energy relative to the baseline model is greater within the gem5 environment than within the Sniper simulation environment. We note, however, that error reported within gem5 is consistent with the expected error associated with imperfect cache warming associated with a warmup phase duration of 3 million instructions as reported by the pFSA model [10] . This error is directly observable in considering the reported error associated with the ROI granularity of 1. Specifically in considering single kernel benchmarks such as Atax, Gemm, and Lu, in which no parallelism or convergence is exploited, the only functional difference in simulation between the base case and ROI granularity 1 is that the former utilizes a complete cache state, while the latter utilizes an imperfect cache state as determined by fast forwarding and warmup modes. Thus the error reported from ROI granularity of 1 directly corresponds with the error associated with imperfect cache warming and warmup phases of the gem5. The consistency of results between ROI granularity 1 and higher ROI granularity demonstrates that this error associated with imperfect cache warming remains the primary source of inaccuracy, even at higher granularities.
Warmup Variation. Results obtained from varying the duration of the warmup phase are shown in Table 3 . Results are provided with respect to the Correlation benchmark running on gem5 at an ROI granularity of 500. For each warmup configuration, we report the run-time associated with the convergence modeling process, the reported latency, and the reported power. Using the complete detailed simulation as our base case, we further report the speedup, latency error, and power error relative to the base.
In general, increasing the duration of the warmup period decreases the overall speedup. This is to be expected, as cache profiled warmup modes are significantly slower than the corresponding fast forwarding mode. While increasing the warmup period does not decrease the amount of prefix code that must be simulated for each ROI, it modifies the distribution, causing less of the prefix code to be simulated in the fast forward mode and more of the prefix code to be simulated in the slower cache profiling mode. We further note that increasing the warmup duration decreases the error associated with both latency and power. This is also to be expected, as increasing warmup duration also increases the cache history, resulting in a more accurate cache profile at the beginning of detailed simulation. As cache behavior has a direct impact on CPU power and latency, accurate cache states result in decreased error of the overall profile.
CONCLUSION
We have herein developed and implemented a complete flow for application within the domain of HW/SW co-design. Beginning with source code, we identify and subdivide regions of highly iterative code into fine-grained regions of interest. Utilizing phase convergence modeling with fine granularity region-of-interest profiling, we achieve a maximum speedup of 105.78x and 562x for Sniper and gem5 respectively and an average simulation speedup of 42.7x and 323.1x with only minor losses of profile accuracy. We then automatically back-annotate performance data into our original executable code, which we then wrap using the TLM 2.0 framework. The result is a flow that efficiently converts host executed source code into TLM 2.0 compliant modules for direct incorporation into a virtual prototype design.
