Abstract-The scaling of integrated circuits into the nanometer regime has led to variations emerging as a primary design concern. Most efforts in the area of variation-tolerant design have focused on the physical, circuit, and logic levels of abstraction. However, inevitable increases in the magnitude of variations with scaling have elevated them to a design concern that must be addressed starting at the system level. We address the problem of analyzing the performance of system-onchip (SoC) architectures in the presence of variations. A modern SoC is a complex ensemble of components that are organized into multiple voltage and frequency domains or islands. The impact of variations on the clock frequencies of individual SoC components may be analyzed using existing tools, such as circuit-level statistical timing analysis. However, the key challenge that needs to be addressed is how to translate these component-level clock frequency distributions into a system-level performance distribution. This task is particularly complex and challenging due to the interdependences between components' execution, indirect effects of shared resources, and interactions between multiple system-level execution paths. We argue that an accurate variation-aware performance analysis requires Monte Carlo-based repeated system execution. We describe a framework variability emulation for SoC performance analysis (VESPA)-that leverages emulation to significantly speed up the performance analysis without sacrificing the generality and accuracy achieved by Monte Carlo-based simulation. We further improve the efficiency of VESPA by utilizing correlated sampling to reduce the number of samples needed for Monte Carlo simulations. We demonstrate the utility of VESPA by applying it to design variation-tolerant architectures for three example SoCs. Our experiments show the performance improvements of ∼180× compared with the state-of-the-art hardware-software cosimulation tools and also underscore the potential of VESPA to enable variation-aware design and exploration at the system level.
process, and have become more predominant as devices scale into the nanometer regime and toward atomic dimensions [3] . A variation-aware design-the process of understanding the impact of variations on ICs and designing systems that are resilient to them-has therefore emerged as one of the most active research areas in circuits, architecture, and design automation.
Since manufacturing-induced variations are inherently a bottom-up phenomenon, most efforts on addressing them have understandably focused on the later stages of the design cycle, or equivalently the lower levels of design abstraction. These include mask-level techniques, such as resolution enhancement technologies and optical proximity correction, and circuit-level techniques, such as variation-aware transistor sizing [4] , [5] , variation-aware placement and routing [6] , and statistical timing analysis and synthesis at the logic level [7] . Although these techniques have proved to be useful, inevitable increases in variations make it difficult to fully contain their effect at the later stages of the design cycle. Recognizing this, there have been several recent efforts on a variation-aware design at the architectural and system levels. These efforts have shown great potential for effectively addressing variations with substantially lower design overhead and effort.
Developments in system-level specification, validation, and performance analysis techniques, including hardware (HW)/ software (SW) cosimulation, modeling languages, and transaction-based modeling, have largely facilitated the advances in, and adoption of, the system-level design. Similarly, a comprehensive approach to the variation-aware design at the system level will require the development of methodologies to analyze the impact of variations at the system level.
Existing techniques, such as gate-level statistical timing analysis, allow a designer to compute the distributions of clock frequencies of system-on-chip (SoC) components. However, a modern SoC is a complex ensemble of components that are organized into multiple voltage and frequency islands. Moreover, these components interact with each other both directly and indirectly through contention for shared system resources. As a consequence, the clock frequency distributions of SoC components do not directly translate to a system-level performance distribution. For instance, the amount of time an application spends executing on a given core not only depends on the frequency of that particular core, but also has a strong dependence on the frequency of the bus and memory controller, which impact memory access time. A detailed analysis of various system-level factors, such as component interdependence, shared system resources, and multiple paths of execution, which makes the variation-aware system-level performance analysis challenging, is presented in Section III.
A commonly used paradigm for the variation-aware analysis at any level of abstraction is to repeatedly perform simulation or execution (typically in a Monte Carlo loop) while varying the performance/power characteristics of circuit components in each iteration. At the system level, this implies deriving statistical frequency and power distributions for SoC components, and iteratively performing system-level performance/power analysis. While this is the most accurate and general approach, and leverages existing simulation methodologies and tools, the need to run simulation in an iterative loop drastically reduces efficiency and scalability, limiting the possibilities for design space exploration.
A. Paper Overview and Contributions
In this paper, we address the problem of the variation-aware system-level performance analysis, and specifically target the challenge of improving efficiency and scalability, while maintaining the generality and accuracy of the iterative simulation paradigm. Emulation is a widely used approach to drastically speed up the system-level simulation, but it has not been hitherto applied to the variation-aware performance analysis. A naive emulation strategy would require a repeated synthesis of each of the SoC components at different frequencies, and the resulting overhead would considerably reduce the possible benefits attainable through emulation. We instead propose an emulation-based framework-variability emulation for SoC performance analysis (VESPA)-that preserves the inherent efficiency of emulation while accurately capturing the impact of variations on performance at the system level. Fig. 1 shows a high-level overview of the proposed VESPA framework. VESPA comprises three main phases: 1) component variability characterization; 2) variation aware emulation setup; and 3) Monte Carlo driven emulation. During the component variability characterization phase, we synthesize the different SoC components using a commercial design flow and use the Synopsys Primetime VX statistical timing analysis tool to obtain the frequency distribution of each component under variations. In the next stage, the VESPA framework instruments the SoC with essential HW and SW components needed to embed the entire Monte Carlo analysis loop on the field-programmable gate array (FPGA) platform and, thereby, alleviates the need for multiple FPGA synthesis and reprogramming steps. Finally, in the last phase, the emulation platform carries out the required Monte Carlo analysis and utilizes an intelligent correlated sampling scheme to efficiently generate the application performance distribution with the desired level of accuracy.
The significant contributions of this paper are as follows. 1) We study the challenges involved in the systemlevel performance analysis when the clock frequencies of components are statistical distributions rather than deterministic values. We demonstrate that translating component-level performance characteristics into a system-level performance distribution is a complex and challenging problem due to the interdependences between components' execution, indirect effects of shared resources, and interactions between multiple system-level execution paths. Our analysis establishes that accurate variation-aware system-level performance analysis requires repeated system execution, which is prohibitively slow when based on simulation. 2) We describe the VESPA framework that enhances and applies emulation to the problem of the variationaware SoC performance analysis. A key attribute of the proposed framework is that the mechanisms for the adaptation of component frequencies and the control loop for iterative (Monte Carlo) analysis are embedded within the emulation platform, eliminating the need for the resynthesis of the design or FPGA reconfiguration within the iterative loop. The inherent speed of emulation is therefore fully exploited for the variation-aware performance analysis.
3) The VESPA framework employs an intelligent variance reduction scheme, namely, correlated sampling to help reduce the number of Monte Carlo simulations needed for a given target error. 4) We apply the VESPA framework to three example SoCs-an 802.11 media access control (MAC) processor, a JPEG decoder, and an MPEG encoder, and utilize it to design the variation-aware architectures based on the multiple frequency islands. For these SoCs, VESPA provides two orders of magnitude speedup compared with iterative HW/SW cosimulation, and four orders of magnitude speedup compared with register-transfer level (RTL) simulation. Our results also demonstrate the utility of the VESPA framework in driving a variationaware design at the system level. The rest of this paper is organized as follows. Section II summarizes the prior work on variation-aware design at the system level. Section III describes the challenges involved in the variation-aware system-level performance analysis by analyzing the complexity of translating component-level variation characteristics into system-level performance. Section IV describes the proposed VESPA framework. Section V presents the application of the framework to the example SoCs, compares the speed of the proposed framework to HW/SW cosimulation and RTL simulation, and shows how the proposed framework can be applied to drive variation-aware architectural exploration.
II. RELATED WORK
A large body of work over the last decade has analyzed and addressed the impact of variations on ICs at various levels of abstraction. We describe here some representative efforts that focus on the earlier stages of the design cycle.
In the context of SoCs, several efforts have demonstrated the strong potential of addressing variations at the system level. Variation-tolerant design techniques for SoC components, including processors, memories, and on-chip buses, have been proposed. The impact of variations on the architecture of microprocessors was discussed in [8] , triggering a wave of research on variation-aware microarchitecture. A variationaware adaptive architecture and management policy for on-chip caches was presented in [9] . An optimal pipelining strategy for reducing microprocessor power under variations was explored in [10] . Dynamically, changing the pipeline width and speed by shutting down slower execution lanes in superscalar processors was proposed in [11] . Finally, Ndai et al. [12] proposed utilizing multiple cycles for rarely exercised critical paths to help improve microprocessor yield under variations.
Modern SoCs are frequently divided into domains or islands that operate at different clock frequencies, supply voltages, or body bias voltages, in order to obtain the best performance and power under varying workloads or usage scenarios [13] , [14] . Several research efforts have exploited the flexibility offered by the multi-island design style to mitigate the effects of variations [15] - [18] . Multi-island designs allow each island in the SoC to operate at its maximum potential frequency under variations and thereby can significantly improve performance over a single island design, wherein the entire SoC is forced to operate at the frequency of its slowest component.
Design techniques at any level of abstraction require supporting analysis techniques. At the circuit level, statistical techniques have been developed, which analyze the impact of variations on the clock frequency [7] and power consumption [19] of system components. Recent efforts have also looked at utilizing HW accelerators, such as FPGAs [20] and GPUs [21] to help accelerate these statistical techniques. Our use of FPGA platforms is very different in that we do not use it as an HW accelerator for timing analysis but instead use it to emulate the SoC in a cycle accurate manner. Moreover, our approach focuses on a completely different layer of abstraction, namely, the system level.
In the context of a variation-aware system-level design, a limited body of work has explored incorporating the impact of variations into the system-level performance and power analysis. An analytical framework for the performance analysis of multi-island systems under variations was presented in [22] and [23] . Given component clock frequency distributions and a component graph representing the interconnection of components, bounds on the system throughput and latency were derived. This was the first work to address the variationaware performance analysis at the system level. Techniques to perform the variation-aware power analysis at the system level were presented in [24] . A simulation-based power estimation framework was enhanced to produce power traces of SoC components, which were repeatedly analyzed in a Monte Carlo loop to generate a system-level power distribution due to variations in leakage power of the components. An overview of various mechanisms to sense variations and expose them to different layers of the system stack so as to enable systems to opportunistically adapt to variations was presented in [25] . VAREMU [26] extends a virtual machine monitor-based framework to emulate variations in power consumption and fault characteristics. Other studies [27] , [28] have also looked at accelerating reliability analysis by utilizing emulation-based fault-injection strategies.
This paper addresses the problem of the variation-aware performance analysis, for which the only known prior work takes an analytical approach [22] , [23] . These analytical approaches however assume a very restrictive system model in which components do not share any system resources (e.g., memory) and communicate with each other through direct point-to-point interfaces. Our examples SoCs as well as most practical SoCs do not fit this model. Moreover, the model also fails to account for several common effects, such as data-dependent execution times, complex synchronization schemes, and increased communication latency due to cross frequency island interfaces. These models thus tend to be fairly inaccurate for performance estimation, even in the absence of variations. We note that most widespread approaches to system-level performance estimation in practice (regardless of variations) are based on simulation. This is because the simulation-based approaches are very general (no limiting assumptions made regarding the system), offer the flexibility to tradeoff accuracy for efficiency (through the level of detail at which the model is specified), and are the most accurate in accounting for complex effects, such as shared system resources, contention, data-dependent execution times, and timing-dependent changes in execution paths. To the best of our knowledge, ours is the first proposal to apply emulation to the problem of the system-level variationaware performance analysis. The key benefit of our approach is that we significantly improve the speed of the variation-aware performance analysis, without compromising either generality or accuracy.
III. COMPLEXITY OF SYSTEM-LEVEL VARIATION ANALYSIS
In this section, we focus on illustrating the challenges involved in analyzing the impact of process variations on the system-level performance using two example SoCs. The first SoC that we consider implements the 802.11b MAC protocol, a block diagram of which is shown in Fig. 2 . The system takes incoming packets from an on-chip RAM (packet buffer), and performs a cyclic redundancy check (CRC) on it. The packet data are then encrypted using the wired equivalent privacy (WEP) encryption scheme, another CRC computation is performed on this encrypted packet data, and the packet is stored back to the packet buffer. We implemented both the WEP and CRC components as HW accelerators. The components are connected to each other using the Avalon interconnect fabric [29] .
Since we are interested in designing the variation-tolerant SoCs, we utilize the multi-island design style for the SoCthe system is partitioned into three islands that communicate through first-in-first-out (FIFO)-based cross-domain interfaces [14] . For SoCs that contain a single frequency island, we note that the performance analysis problem degenerates to simply determining the frequency distribution, which can be performed using conventional gate-level statistical static timing analysis (SSTA) tools without any system-level analysis. However, such a design would incur significant performance degradation or yield loss due to variations. For instance, in the context of an MPEG encoder SoC, Marculescu and Garg [22] showed the significant improvements in yield by adopting a multiple frequency island design methodology. Furthermore, most modern SoCs today already comprise multiple voltagefrequency islands in order to achieve a good balance between performance and power across varying workloads.
As the first step in understanding the effect of process variations on the above system, we first find the delay (clock frequency) distribution of each of its four components. We synthesize each component using the IBM 45-nm technology library and use commercial timing analysis tools (Primetime VX [30] ) to perform the gate-level statistical timing analysis. Fig. 3 shows a box-whisker plot of the delay of the various components. Fig. 3 shows that the delay distribution of each SoC component can be quite varied. The delay distribution at the component level is mostly a function of its logic depth and the number of critical paths [31] . When the component has a large number of critical paths, calculating the statistical maximum of individual path distributions tends to increase the mean and decrease the standard deviation of the component clock distribution. Similarly, higher logic depth contributes to decreasing the σ/μ ratio of the clock period distribution [31] . In general, different components in an SoC tend to have different circuit-level characteristics, such as logic depth or number of critical paths, and thus, we might find a large diversity in their delay or frequency profiles under variations.
Given the delay distribution profile of each component, we also wish to evaluate how sensitive a system's performance is to variations in its components' frequencies. We measured the system performance of the 802.11 MAC system under varying frequencies of two of its components, namely, WEP and CRC. The results shown in Fig. 4 demonstrate that the system performance is in general much more sensitive to variations in WEP frequency than those in the CRC block. This is due to the system critical path being dominated by the WEP component in most parts of the frequency space. However, there exists a region in this space (highlighted by a rectangle in Fig. 4 ), where the system is more sensitive to the CRC component than the WEP. In this region, the CRC component starts to dominate the system critical path. Therefore, variations in its frequency have a much larger impact on system performance.
To further quantify the sensitivity of system performance to component frequency variations, we consider two different instances of the MAC system with nominal frequencies corresponding to points A and B in Fig. 4 and measure the throughput for 95% yield with varying σ/μ of component frequencies. The nominal frequency of the CPU and the packet buffer components at both operating points A and B is fixed at 400 MHz. Fig. 5 now shows the normalized throughout of the entire SoC versus increasing variations in the frequency of each of its components around the two operating points A and B. As shown in Fig. 5 , the system operating at point A is most sensitive to the variations in the WEP frequency, whereas the system at point B is most sensitive to the CRC block. In general, different components in an SoC will have different sensitivities and even these sensitivities are a function of the frequencies of operation of the other components of the system.
In general, calculating the sensitivity of system performance to component variations is a very complex problem due to several factors. 1) Component Interdependence: System performance sensitivity to a components' frequency is determined by the percentage of time that a component spends in the system critical path, which in turn depends on the component's interaction (synchronization and communication) with other SoC components. 2) Shared Resources: Resources such as a system bus that are shared by all components result in contentions, which lead to variable latencies. These contention profiles and thus latencies change in the presence of component variations.
3) Multiple Paths of Execution:
In systems that have multiple paths of execution, variations in a components' frequency can cause a given path to speed up or slow down. As a result, the component may no longer even be a part of the system's critical path, causing discontinuities in the system's sensitivity to a component.
A. Quantifying the Factors Impacting Variation-Aware Performance Analysis
In this section, we quantify the aforementioned intricacies that complicate the variation-aware performance analysis with the help of a JPEG decoder SoC. The system, shown in Fig. 6 , performs the necessary operations involved in JPEG decoding, using a dual-core architecture. Both the cores in the above SoC are designed to operate at the nominal frequency of 400 MHz, and the required tasks are divided among them, such that their execution times are balanced across both the cores. The JPEG decoder has been pipelined at the granularity of macroblocks of an JPEG image. The first core fetches a macroblock from memory, performs variable length decoding (VLD), dequantization (DQ), and inverse discrete cosine transform (IDCT) operations on it, and stores the result into a shared image buffer. The second core then fetches the pixel data from the image buffer and performs chroma upsampling and color conversion to produce an RGB representation of the image, which is then appropriately reordered and written back to memory.
1) Component Interdependence:
In order to illustrate the impact of component interdependence, we measure the impact of changing the frequency of the image buffer on the number of cycles spent by the JPEG application on CPU Core 2. As shown in Fig. 7 , with increasing the frequency of the image buffer, the number of cycles spent by the application on Core 2 decreases due to the reduced latency associated with memory accesses.
2) Shared Resources: In the JPEG SoC shown in Fig. 6 , the two CPU cores share the interconnect fabric and the underlying memory subsystem, leading to contention for these shared resources. To explore the impact of this contention, we vary the frequency of Core 2 and measure the total number of cycles spent by the application on Core 1. In the absence of contention, changing the frequency of Core 2 should not have any impact on Core 1. However, as shown in Fig. 8 , changing the frequency of Core 2 produces a nonnegligible impact on Core 1's performance. Decreasing the frequency of Core 2 below its nominal operating point decreases the number of memory accesses issued by it per unit time. This results in decreased contention for the memory and interconnect resources from Core 1's perspective, leading to a decrease in the number of cycles spent by the application on Core 1 (increased performance). Increasing Core 2's frequency beyond the nominal operating point causes the number of cycles spent on Core 1 to increase (albeit at a much slower rate than in the previous case). Upon increasing the frequency of Core 2, the number of memory accesses issued by it per unit time stays constant, as Core 2 can only proceed once Core 1 has finished its computation on the current macroblock. As a result, at higher frequencies, Core 2's memory accesses get clustered more and more, leading to some periods where Core 1 experiences high contention and other periods where it encounters virtually zero contention. These asymmetric contention patterns have a net impact of a very small increase in the number of cycles spent by the application on Core 1 (decreased performance).
3) Multiple Paths of Execution: For the example of JPEG decoder system, there exist two main paths of execution, one comprising of the operations performed on Core 1 (VLD, DQ, and IDCT) and the other consisting of the operations performed on Core 2 (upsampling and color conversion). The net system performance of the JPEG application is determined by the maximum amount of execution time spent by the application on each core. This dependence can lead to abrupt discontinuities in the sensitivity of the JPEG SoC's overall performance to the changes in the frequencies of the two cores. To illustrate this phenomenon, we plot the normalized system throughput (size of the JPEG image decoded per second) of the JPEG decoder application with varying frequencies of Cores 1 and 2 (with the other core operating at a fixed nominal frequency). As shown in Fig. 9 , the sensitivity of system performance to the frequencies of Cores 1 and 2 has a large discontinuity around the nominal operating point. When Core 1's frequency is decreased below the nominal operating point, system performance is strongly dependent on it. However, increasing Core 1's frequency above the nominal point has very little impact on system performance, as, in this frequency, domain overall system performance is mostly dependent on Core 2's operating frequency and Core 1 is no longer a part of the system critical path (a small dependence, however, still exists due to the aforementioned impact of shared resources). A similar line of reasoning can be applied to explain the dependence of overall system performance to the variations in Core 2's frequency.
The above discussion clearly illustrates that the task of translating component-level delay distributions into a systemlevel performance distribution is indeed a challenging and complex task. One should also note that in general, there is nearly no correlation between the delay profiles and the sensitivity of system performance to the variations in a components' frequency, as delay is mostly a circuit characteristic, whereas sensitivity depends mostly on the percentage of the time that a component spends in the system's critical path. Any accurate and general system-level performance analysis technique must therefore consider the aforementioned intricacies. As noted earlier, analytical techniques [22] , [23] while being relatively fast fail to account for these intricacies and can therefore lead to inaccurate estimates of the system's performance distribution. The only approach that is capable of capturing these effects is to perform repeated Monte Carlo simulations of the system for a sufficiently large number of samples, so as to have a high level of confidence in the resulting performance distribution. Using sampling theory [32] , we estimate that the MAC system discussed above requires 481 samples for a confidence level of 99% on the estimated mean value with a margin of error less than 1%. 1 Performing such a large number of system-level simulations would require a prohibitively large runtime and is not practical.
Emulation is a well-known technique for improving the runtime of simulation. In general, it provides the order of magnitude improvement in performance over system simulation at various levels of abstraction. However, using standard emulation for the variation-aware analysis poses a number of challenges. The variation-aware analysis requires emulating the system with components operating at multiple sample frequencies. A naive emulation strategy would require repeated synthesis of each of the SoC components at the different sample frequencies selected during the Monte Carlo sampling process. The synthesized SoC would then be programmed on to the emulation platform and executed to capture the desired system performance. However, the considerable overheads involved in repeated synthesis and programming would significantly eat into the possible benefits obtainable through emulation. In contrast, the VESPA framework performs these operations exactly once. This is accomplished by utilizing reconfigurable PLLs whose frequencies can be changed on the fly using an SW control routine. This process is discussed in detail in Section IV.
IV. VESPA FRAMEWORK
In this section, we present an overview of the proposed variation-aware emulation framework. The framework takes as its input, the SoC architecture, the application SW that executes on the SoC, RTL component models, and variationaware cell libraries, and outputs the system performance distribution. The flow, as shown in Fig. 10 , can be divided into three distinct phases. The component variability characterization phase generates the delay distribution for each component in the SoC. The variation-aware emulation setup phase instruments the SoC with various HW components and SW routines essential for performing variation-aware performance analysis. The Monte Carlo driven emulation phase consists of the various steps that the runtime control SW executes so as to obtain the desired performance distribution. We now provide a detailed description of each of these stages.
A. Component Variability Characterization
In this phase, we first synthesize different components of the system from their RTL models using commercial logic synthesis tools. A gate-level variation-aware technology library is fed to an SSTA engine along with the synthesized component netlist to compute the frequency distribution for each component of the SoC, considering the structural correlation that may exist between gates and paths within each component. At the end of this phase, we obtain the component-level delay distributions for each component in the SoC.
B. Variation-Aware Emulation Setup
In this phase, we perform the design-time operations required for setting up our emulation framework. The SoC architecture specifies how the various SoC components are to be interconnected with each other. Based on this model as well as designer input, the SoC is partitioned into multiple frequency islands. We then insert FIFOs and other asynchronous interface logic, so that the different frequency islands can communicate correctly and efficiently with each other. In order to perform the variation-aware analysis efficiently, we need to change the frequency of each island without performing repeated synthesis. This is accomplished by utilizing reconfigurable PLLs whose frequencies can be controlled at runtime by programming them from SW executing on a microprocessor.
The frequency distribution obtained from the first phase now needs to be downscaled by an appropriate factor before the SoC can be synthesized to the emulation platform. Choosing a minimum possible downscaling value is desirable as it maximally exploits the speedup potential of the emulation platform. The Monte Carlo analysis may require that each component in the SoC be configured to a wide range of frequencies ranging from μ − 3σ to μ + 3σ . Computing the downscaling factor based on the worst case μ + 3σ operating point would considerably impact the achievable speedup from emulation, as most samples would be closer to the mean of the distribution than the 3σ point. Thus, during synthesis, we choose a downscaling factor that optimizes for the average case (the mean of the component frequency distribution) instead of the worst case. Each frequency island in the SoC is then synthesized with the above computed downscaling factors, using a commercial FPGA synthesis flow, and the bitstream for the SoC is then downloaded onto the target emulation platform. The maximum frequencies at which each island can operate are utilized by the SW control flow to dynamically determine the optimal downscaling factor for each Monte Carlo sample.
On the SW front, the given SoC application program is encapsulated within an SW control routine and embedded with the component-level frequency distribution data obtained from the component variability characterization phase. The instrumented application code is then cross compiled, and the resulting SW image is downloaded for execution onto the emulation platform.
C. Monte Carlo Driven Emulation
In the Monte Carlo driven emulation phase, the SW control routine first samples the component distributions to obtain their operating frequencies for the current emulation run. It then needs to compute a sample specific downscaling factor for the entire SoC so as to ensure correct operation. Now, as shown in (1), the i th island's minimum downscaling factor DF i determined by taking the ratio of its sampled frequency SF i to its maximum operating frequency on the emulation platform EF i . The downscaling factor for the entire SoC DF SoC is then computed by taking the maximum of all the individual island downscaling factors
The SW control loop then reconfigures each island's PLL to the downscaled frequency determined from the previous step. The SoC application is then executed and its performance is measured. This process is repeated for a fixed number of samples, or until a given level of confidence is achieved in the generated distribution. Once the SW control loop terminates, the desired performance distribution is obtained. In order to further increase the performance of the Monte Carlo driven emulation phase, we improve upon naive Monte Carlo-based simulation by utilizing variance reduction techniques [33] , [34] . These techniques incorporate various domain specific insights to help reduce the number of samples needed in a Monte Carlo simulation. In Sections IV-C1 and IV-C2, we first give an overview of the proposed variance reduction scheme, namely, correlated sampling using control variables. We then describe in detail our choice of control variable for estimating SoC performance as a function of component-level frequencies.
1) Correlated Sampling Using Control Variables:
In the conventional Monte Carlo analysis, for a random variable Y , the estimated mean E(Y ) and the standard error in the estimated mean SE(Y ) are given by (2) . As can be seen from (2), the standard error in the estimate is directly proportional to the variance of the sampled random variable Y . Thus, the number of Monte Carlo samples N needed to achieve a desired error bound SE(Y ) can be reduced if the variance of Y is smaller
Now, if Y is the output of a Monte Carlo simulation run, a random variableỸ , obtained from the same simulation run, is called a control variable ifỸ and Y are correlated (negatively or positively) and the expectation ofỸ is known. Control variables can be used for variance reduction because of Theorem 1 [33] .
Theorem 1: Let Y 1 , . . . , Y N be the output of N independent simulation runs, and letỸ i , . . . ,Ỹ N be the corresponding control variables, with E(Ỹ ) known. Let ρ YỸ be the correlation coefficient between Y andỸ . For all α ∈ R, the linear estimator L given by
is
an unbiased estimator of E(Y ), i.e., E(L) = E(Y ), and the minimal variance of L is given by
which is obtained for α = ρ 2
YỸ
/Var(Ỹ ) Thus, the key idea behind correlated sampling is to use an unbiased estimator L, which has the same expectation value as E(Y ), but it has a much smaller variance. Algorithm 1 outlines the key steps involved in estimating E(Y ) using control variables. At the start of each Monte Carlo run, we first sample the component-level frequency distributions to obtain the operating frequencies of each component in the SoC. We then utilize the VESPA framework to obtain the system performance Y N at these target frequencies. We also compute the value of the control variableỸ N at these sampled frequencies (explained in detail in Section IV-C2). We now compute the correlation coefficient ρ YỸ as well as the standard error, SE, in our estimate of E(Y ). We repeat this process until the standard error is below the required error bound. Note that in Algorithm 1, the mean of the control variable, E(Ỹ ), is a constant and can be precomputed either through analytical methods or by fast SW simulations.
As can be seen from Algorithm 1, the standard error in the estimate E(Y ) is given by 2) Analytical Performance Model: In Section IV-C1, we showed that for correlated sampling to be effective, the control variable should be well correlated with the output of the Monte Carlo simulation. In the VESPA framework, the output of a simulation run is the total system performance for a given input of sampled island frequencies. Thus, we need our control variable to also compute the total system performance as a function of island frequencies. As noted in Section III, this is a difficult problem and needs to account for various complex interdependences. However, for reducing the number of Monte Carlo samples, the control variable only needs to have some correlation to the measured performance.
We now show how we construct such a control variable for any given SoC that has been divided into N frequency islands. A simple linear model, which computes system performance P for a given set of island frequencies F 1 , F 2 . . . F N , is shown in
where P nom is the nominal system performance with no variations in island frequencies, F i = F i − F i,nom is the change in operating frequency of the i th island, and c i captures the sensitivity of system performance to changes in the i th island's operating frequency. Equation (6) can be simplified as shown in
The coefficients c 0 , c 1 . . . c N are calibrated dynamically from the actual emulation data. In the VESPA framework, we take the first few samples of the Monte Carlo driven emulation phase and perform a least square fit of the data to obtain the required coefficients. Since the above linear model assumes that the system sensitivity coefficients are constants, it fails to account for various complex factors, such as intercomponent dependences and shared resources. However, as will be shown in Section V-C, the linear model succeeds in reducing the number of samples required for Monte Carlo emulation.
V. EXPERIMENTAL METHODOLOGY AND RESULTS
In this section, we first describe our experimental setup and show that considerable speedup can be achieved using the VESPA framework compared with the existing techniques. We then present a brief description of various insights obtained from performing the variation-aware analysis of three different SoCs, viz., an MPEG encoder, an 802.11b MAC processor, and a multicore JPEG decoder.
A. Experimental Methodology
In the component variability characterization phase, Synopsys Design Compiler was used to synthesize various SoC components using the IBM 45-nm technology library. Variations were modeled in accordance with [35] , which combines both the with-die and die-to-die components and models the variations of the order of σ/μ = 0.1 in L and σ/μ = 0.03 in V th based on ITRS projections for 90 nm. The SSTA analysis was performed using Synopsys Primetime-VX [30] to obtain the frequency distributions of different components. The access time of the memory components was estimated using CACTI5. 3 [36] , and the variations in memory frequency were modeled in accordance with [37] . In the VESPA framework, we perform variability characterization at the coarse granularity of SoC components and, therefore, do not model the impact of spatial correlation in within-die variations across various SoC components. We, however, note that the VESPA framework is general, and can easily model the impact of spatial correlations by appropriately incorporating this correlation during the Monte Carlo sampling process.
We used an Altera DE3 board equipped with a Stratix III EPS3SL150 FPGA as our emulation platform. The Stratix III EPS3SL150 FPGA comprises eight reconfigurable PLL units, the frequencies of which can be reconfigured at a granularity of 0.2 MHz. For compiling the RTL models of the SoC for the FPGA platform, we utilized the Quartus 10.0 design flow. The Nios2 IDE was then used to create the SW control loop that controls the runtime emulation flow. The runtime of the proposed framework was compared with the models of the SoCs simulated using Gezel [38] , an HW/SW cosimulation tool, and Modelsim, a cycle accurate RTL simulator. Both Gezel and Modelsim simulations were performed on an Intel Xeon 3.2-GHz processor.
We consider three example SoCs-an 802.11b MAC processor, a multicore JPEG decoder, and an MPEG encoder. The MAC and JPEG decoder systems were described in detail in Section III. Fig. 11 shows the block diagram of the MPEG encoder system. The input frames to be encoded are stored in the frame buffer. The MPEG controller coordinates the transfer of blocks from the frame buffer to the input buffer, and the motion estimation is performed by comparing the current and previous frames. The DCT block computes the DCT and stores the compressed data in the output buffer. The performance of the system is primarily determined by the time required to read packets from the frame buffer and the time taken to perform motion estimation. The CPU and the DCT have a relatively small contribution to the system critical path. Fig. 12 shows the box-whisker plot of the delay of the various components of the MPEG decoder SoC. As noted earlier, due to intrinsic variations in component-level properties, such as logic depth and number of critical paths, one can notice the significant variations in the delay distribution profile of the different components.
B. Speedup
In this section, we report the speedup offered by utilizing the VESPA framework compared with other simulation-based methodologies, viz., HW/SW cosimulation and cycle accurate RTL simulation. Table I compares the number of system cycles executed per second using ModelSim, Gezel, and VESPA. It also lists the reduction in number of Monte Carlo samples achieved by enhancing the VESPA framework with a correlated sampling scheme (VESPA-CS). The number of samples is obtained by continuing the sampling process until the standard error of the Monte Carlo analysis is below the target error bound of 0.5%. As shown in Table I , on average, our proposed framework (VESPA-CS) achieves a speedup of 250 000× compared with RTL simulation and 200× with respect to HW/SW cosimulation using Gezel. These results clearly demonstrate that the proposed framework can be used to efficiently model the effect of process variations on system performance.
C. Sample Reduction Achieved by Correlated Sampling
In this section, we demonstrate the benefits of incorporating the proposed correlated sampling technique into the Monte Carlo driven emulation phase of the VESPA framework. We quantify the sample reduction engendered by correlated sampling for all three example SoCs. As mentioned earlier in Section IV-C, for correlated sampling to be most effective, there needs to exist a high degree of correlation between the control variable and the actual measured system performance. Fig. 13 shows, for each of the three systems, the scatter plot of the actual measured system performance versus that predicted by the linear model. From the graphs, we observe a high degree of correlation does indeed exist, even though the worst case difference between the control variable and the actual system performance might be high (as much as 33% for the MAC SoC).
As noted earlier, in the conventional Monte Carlo analysis, the standard error in the estimate (SE) is given by SE = (1/N MC σ 2 ) 1/2 , where N MC is the number of samples used in the conventional Monte Carlo analysis. In the case of correlated sampling, the standard SE is given by
1/2 , where N CS is the number of samples used in the correlated sampling case and ρ is the correlation coefficient between the actual performance and the predicted performance. Thus, for the same standard error, the number of samples in the correlated sampling case is given by N CS = N MC (1 − ρ 2 ). However, in this case, we additionally utilize the first ten samples from the Monte Carlo analysis for calibrating our performance model, and thus, the total number of samples utilized N CS = N MC (1 − ρ 2 ) + 10. Thus, correlated sampling reduces the number of samples by the factor
The measured correlation coefficients are 0.85, 0.97, and 0.86 for the MAC, MPEG, and JPEG systems, respectively. This high correlation thus translates into a large reduction in the number of Monte Carlo samples needed for a given target error metric. The correlation coefficient is highest for SoCs, where system-level complications, such as multiple paths of execution and shared resources, are less dominant, and hence, the analytical performance model can more accurately predict the actual measured performance. Fig. 14 shows the standard error in Monte Carlo estimates versus the number of samples. In Fig. 14 , we can clearly see that correlated sampling significantly reduces the number of samples required for any target error. For an error target bound of 0.5%, the reduction in the number of samples is 2.76×, 9.67×, and 2.81× for the MAC, MPEG, and JPEG systems, respectively. We can clearly see that systems whose control variable has a higher degree of correlation (MPEG > JPEG > MAC) achieve the most reduction in the number of samples.
D. Variation-Aware System Analysis
In this section, we describe the insights obtained from utilizing the VESPA framework to study the variation-aware system design.
1) MPEG Encoder SoC:
In this section, we focus on the problem of obtaining an island partitioning scheme for the MPEG system. We demonstrate that there exists a tradeoff between the inherent variation tolerance offered by larger numbers of islands and the overheads introduced by asynchronous communication interfaces. We also show that, for a given number of islands, the impact of variations on system performance greatly depends on the chosen component-to-island mapping. Fig. 15 shows six different partitions of the MPEG system considered in our analysis. Fig. 16 shows the corresponding system performance distribution for these configurations. In case of a single island configuration (1 island), the slower components end up determining the overall SoC operating frequency, resulting in the suboptimal system performance. With the increase in the number of islands (2 island-1 and 3 island-1), more and more components can now be operated at their best possible frequency, resulting in the improved system performance. However, an overly fine-grained partition (4 island) would result in performance degradation due to the increased latency overheads associated with interisland communication interfaces.
For a fixed number of islands, both the system performance and its sensitivity to components depend on the component-to-island mapping. For example, let us compare the 3 island-1 and 3 island-2 architectures shown in Fig. 15 . Fig. 17 shows the sensitivity of system performance to the variations in island frequencies for both configurations. In the 3 island-1 configuration, the frame buffer and the MPEG controller belong to the same frequency island. This obviates the need for high-latency asynchronous communication interfaces between these two components, thereby reducing the latency associated with fetching macroblocks. In this configuration, system performance is determined mainly by the ME block, which makes it more sensitive to the variations in the ME + DCT island than the FB + MC island. On the other hand, in the 3 island-2 configuration, the frame buffer and the MPEG controller are in different islands, and the memory read operations become a substantial part of the system critical path. This makes the system performance sensitive to the variations in frequency of the frame buffer and ME block.
2) 802.11b MAC Processor SoC: In this section, we perform an exhaustive design space exploration for the MAC system described in Section III and demonstrate that the design schemes that appear to be optimal under nominal operating conditions can become suboptimal under the impact of variations. Fig. 18 shows the throughput that can guarantee 95% yield for all possible island partitions and componentto-island mappings. System performance again increases initially with increasing number of islands (until three islands). Performance distributions for two three-island designs for the 802.11 MAC processor.
However, due to the high overheads involved in interisland communication, further increasing the number of islands deteriorates system performance. For a given number of islands, the component-to-island mapping also has a huge impact on the system performance, as shown by the vertical spread in Fig. 18 .
Consider the two points encircled in Fig. 18 , which correspond to two three-island configurations, one in which the packet buffer and the CRC component are mapped to the same island (CRC + PB) and another in which the packet buffer is grouped along with the WEP component (WEP + PB). Fig. 19 shows the system performance distribution profile for these two configurations. When variations are ignored, both configurations would yield nearly identical performance. In both the configurations, the individual component's throughput is limited by memory access times. In the WEP + PB configuration, the CRC component has to communicate across an island partition, leading to higher memory access times. As a result, system performance becomes highly sensitive to CRC variations. Similarly, in the CRC + PB configuration, system performance is more sensitive to the WEP component. From the SSTA analysis, it turns out that the WEP component has much larger σ in the clock period than the CRC block. Thus, as shown in Fig. 19 , the CRC + PB configuration is more severely impacted by variations than the WEP + PB configuration. As a result, for the same yield requirement, the WEP + PB mapping is clearly better than CRC + PB mapping. In general, architectures in which high variability components are not a large part of the system critical path tend to be more tolerant to variations.
3) Multicore JPEG Decoder SoC: In this section, we utilize the VESPA framework to study the nonintuitive impact of resource sharing and multiple system-level critical paths on determining the optimal island partitioning for the JPEG decoder SoC. We consider two island partitioning schemes for the JPEG decoder-a two-island scheme in which the two processing units (CPUs) are grouped together and a threeisland scheme wherein each component is a separate frequency island.
In the three-island system, the two processors do not directly interact with each other and, hence, introduce no additional latency overheads. As a consequence, one would expect it to perform better than its two-island counterpart under variations. However, due to the existence of multiple critical paths in the system, either processor operating at a comparatively faster speed than the other does not lead to an improved overall system performance. In addition to the above considerations, there exist two other nonintuitive effects that play a role in determining the system performance under variations. As explained in Section III, when the processors' operating frequencies differ, the faster processor's memory accesses get clustered together, leading to an asymmetric bus contention profile for the slower processor and thereby degrading the overall system performance.
Another factor that impacts performance is the varying degrees of compute intensiveness associated with the application code being executed on each processor. If a code segment involves a lot of memory accesses (less compute intensive), changing the frequency of the processor has a lesser impact on its performance as it spends a large proportion of its cycles waiting on memory accesses. Fig. 20 shows this effect by plotting the normalized time taken by each processor to execute their corresponding code segments for differing operating frequencies. As shown in Fig. 20 , CPU 1's performance is more sensitive to the changes in its operating frequency as compared with CPU 2. Now consider a scenario wherein CPU 2 slows down due to variations. In the two-island scheme, CPU 1 is forced to operate at the same lower frequency as that of CPU 2 and, thus, has a much larger impact on system performance (due to CPU 1's higher sensitivity to frequency variations). Whereas in the three-island scheme, the frequency of operation of CPUs 1 and 2 is decoupled, and therefore, the degradation in system performance is mostly caused due to CPU 2, which has a much smaller sensitivity to frequency changes.
All these above effects interact together, resulting in a system performance distribution shown in Fig. 21 . As can be seen, the three-island scheme performs only marginally better than its two-island counterpart.
In summary, our experiments clearly establish the value of the VESPA framework in driving the variation-tolerant system design.
VI. CONCLUSION
In this paper, we presented VESPA, a framework that analyzes the impact of variations on the system-level performance. The VESPA framework utilizes emulation to significantly speed up the performance analysis without sacrificing the generality and accuracy of simulation. The framework also employs an intelligent variance reduction scheme, namely, correlated sampling to further reduce the number of samples need for the Monte Carlo analysis. VESPA achieves, on average, two orders of magnitude speedup over the state-of-the-art HW/SW cosimulation tools and four orders of magnitude speedup over RTL simulation tools. We also demonstrate how the VESPA framework helps the system designers to incorporate the impact of variations into their design process and enables them to create variation-tolerant system architectures.
