This article studies the interplay between the performance, energy, and reliability (PER) of parallel-computing systems.
I
n digital CMOS circuits, for any specific hardware platform, a higher supply voltage (which we call V) usually permits a higher operating (clock) frequency (which we call F) and hence a higher throughput. Dynamic voltage and frequency scaling (DVFS) scales V and F together to obtain the best throughput under a given power budget or to save power for a given throughput requirement. 1 We can increase throughput for a given power limit or reduce power while maintaining throughput by combining DVFS with scaling the number of computation units if the computation is parallelizable. 2 A major challenge of precisely analyzing the effectiveness of using parallelization for these goals is to determine the parallelizability of any particular execution. This involves complex issues such as software and hardware architecture details and must be modeled on a per-execution basis. 3 Another challenge is that quantitative studies of power or throughput improvements for any DVFS decision also need complicated execution-dependent models. 4 This article explores the interplay between DVFS and parallelization scalability with respect to performance and power. To capture the interplay, we use the concept of a reliable operating region (ROR), which we establish from the knowledge of system reliability through experiments or simulations. The ROR therefore encapsulates platform and application specifics, thus helping to make any subsequent analysis steps generic.
In particular, we focus on the effectiveness of combining DVFS with parallelization scaling to improve energy efficiency. To facilitate cross-platform comparisons of this effectiveness, we define a quantitative metric η. Our RORbased method can explore parallelization scaling across a platform's voltage range, from subthreshold to superthreshold regions. The explorations and models we present here confirm and explain the general view that combining DVFS and parallelization scaling produces the best advantage when V is scaled down to near-threshold voltages. This is called near-threshold computing. 5 However, current commercial platforms tend to avoid this region, as we show later.
To exploit our findings, we developed a PER (performance, energy, and reliability) modeling tool, which lets users find optimal operating points. The tool also has applications for academic teaching and research. Our research has shed light onto the permissible points of operation under various system design or implementation constraints. The models we derived are not for use in absolute-value predictions but are for exploring PER and scalability relationships through parallelization scaling by studying the effectiveness metric η.
THE RELIABLE OPERATING REGION
Operational reliability depends on the system platform, including various hardware-related design-time and runtime decisions as well as applications and their requirements. We can use different metrics to describe the degree of reliability. Our method is agnostic to the type of reliability metric as long as it facilitates a fair comparison.
A popular reliability metric for comparing systems and applications is mean time between failures, which assumes that failure can be fairly defined in each comparison. For instance, failure can be defined as losing accuracy, and accuracy metrics such as the signal-to-noise ratio (SNR)-widely used in information engineering-can be much easier to measure in experiments but are application-specific.
This article does not try to study the relationships between different reliability metrics but assumes that, for any problem being studied, a metric or set of metrics can be agreed on. We use SNR here to demonstrate the execution-independent comparison of η applied to execution-dependent data (such as SNR), as we describe later. Furthermore, we focus on the effects of voltage, frequency, and parallelization scaling and do not address reliability's dependence on other (for example, microarchitecture and application) design-time and runtime decisions. For any given application design and microarchitecture choice, our proposed models and techniques apply to voltage-, frequency-, and parallelization-scaling decisions.
To achieve any particular value of any reliability metric, a system must operate within voltage and frequency constraints. For instance, reducing V might increase the soft error rate (SER). 6 Conversely, increasing V increases temperature, accelerating aging and the probability of breakdowns. 7 This leads to
An execution might require more than a certain level of throughput, θ, to be meaningful. 8 This leads to θ ≥ θ min .
We can address fault tolerance by requiring computational redundancy, which is reflected in an increased aggregated θ min . The parallelization scaling can then capture the tradeoff between spatial redundancy and time redundancy. We can increase throughput to give additional time for error handling to provide temporal redundancy. We can also use extra cores for spatial redundancy, running multiple copies of an execution on multiple cores. Both approaches increase θ min . The amount of available power, P max , limits the system's behavior. 2, 9 If any hardware runs with a clock too fast for its V, computations might not complete in time, leading to reductions in any reasonable reliability metric. With aging, to maintain the same θ, V must increase. 10 We explore the concept of the ROR in the context of a throughput-voltage (θ-V) space. The ROR for a platform in the θ-V space is bounded by V max , V min , θ min , P max , and the timing reliability limit (TRL) (see Figure 1 ). For a look at the ROR of a real platform and application, see Figure 2 .
The ROR boundaries are directly related to the physical causes of all types of reliability issues attributable to computation. It should be possible to express any reasonable reliability metric with these boundaries.
This method caters to both execution-specific RORs, which are more precise but require more effort to See www.computer.org/computer-multimedia for multimedia content related to this article.
obtain, and conservative RORs (see Figure 1 ). Commercial systems typically provide RORs in the form of predefined sets of conservative DVFS points. This allows for execution-independent studies.
EXPLORING DVFS AND PARALLELIZATION SCALING
Because only P max and the TRL depend on both θ and V, only those boundaries affect DVFS and parallelization scaling.
Switching-power considerations
Switching (dynamic) power is related to F and V as
where A is a coefficient influenced by the hardware area and switching activity.
Computational throughput is usually expressed in instructions per second (IPS); this metric is related to F through instructions per clock cycle (IPC or u)-that is, θ = uF. Assuming that a certain execution has a constant IPC, we can plot constant P max curves in the θ-V space, following the equation labeled D in Figure 1 (see Figure 3a) . For an execution with a variable IPC, we can use the average IPC to establish average P max curves, or use the maximum IPC to establish conservative P max curves.
Next, we explore parallelization scaling, initially assuming ideal scaling with θ k = kθ, where k is the number of computation units (we discuss nonideal parallelization scaling later). Under these conditions with a scaling factor of k (which we call k-scaling), A is scaled in the same way:
where A k is the hardware's A coefficient after k-scaling.
If the power budget does not change after k-scaling, for each computation unit in a scaled setup, Equation 1 becomes
where P is the power dissipation of a single computation unit, because k units share the power budget. The per-unit power reduction by a factor of k usually reduces a unit's maximum throughput by less than k. This reduced maximum throughput multiplied across k units provides a net increase in usable throughput (see Figure 3a) , which motivates combining DVFS with parallelization.
As an example of determining the ROR from experimental data, we use the P, F, and V data from an asynchronous SRAM (static RAM) controller. 11 Consideration of switching power alone is valid only in the V range where the switching power dominates. This turns out to be 0.6 ≤ V ≤ 1.2 volts, where the experimental data shows A to be nearly constant. The SRAM controller is self-timed and hence always runs at the highest speed that maintains 100 percent reliability when operating within the aforementioned V range. It also has two computational actions, read and write, each with a constant IPC. This results in the TRL curves in Figure 3b . Although this hardware is only a memory controller and not a full processor core, it is a CMOS combinatorial-logic block, and we can explore core scaling with its curves without losing generality. Similar TRLs have been observed from experimental data on many Performance requirement
The reliable operating region (ROR) is bounded by the high and low voltage limits (V max and V min ; see the equations labeled A and B), the throughput (θ) requirement line (see the equation labeled C), the power limit (P max ; see the equation labeled D), and the timing (clock) reliability limit (TRL; labeled E). The TRL is usually obtained through experiments or specified by the platform vendor. An exact ROR is application dependent and provides for the most efficient operation. Shrinking the ROR in the directions of the arrows will eventually provide conservative, application-independent operating points. Various reliability factors affect these boundaries. A more stringent soft error rate (SER) requirement pushes V min to the right (V must increase to reduce the SER). Aging decreases the TRL because a higher V is necessary to maintain the same θ. However, a higher V accelerates aging; hence, a more stringent aging-speed requirement pushes V max to the left, reducing how much V can be raised.
combinatorial-logic computation units, including full cores running standard benchmarks. 12 The ROR (here without considering V max , V min , or θ min ) decreases when P max decreases.
Additional power consumption considerations
Below approximately 0.6 volts in the previous example, Equation 1 no longer approximates the total power because leakage power becomes significant. Instead of using the complex power equations that take leakage power into account, we can use the observed power from experiments to draw the constant-power curves.
To determine the power boundary's shape for any P max , given Equation 2, for each point i for which experimental power data exists, we calculate the maximum scaling factor k maxi :
where P i is the experimental single-core power observed at i.
Plotting θ = k maxi θ i produces the constant-power curve P = P max in the θ-V space. Figure 3b also shows the P = P max curves for the asynchronous SRAM controller. Researchers have observed similar constant-power curve shapes in other platforms. 12 Scaling's benefit decreases when leakage power becomes important. Scaling with a factor of four from 0.6 volts leads to slightly more than 0.4 volts with a roughly one-third throughput increase. Scaling further may actually reduce θ max .
When P max increases, the scope for scaling further is enhanced. The k = 16 TRL intersects the one-fourth P max boundary at a lower V (worse scaling) than where it intersects the P max boundary.
Generally, a system's design may be limited by P max and the hardware availability limit k max . The k-scaling characteristics based on the ROR shown in Figure 3 help designers find the best P for a given k max and the best k for a given P max .
HOW THE TRL INFLUENCES THROUGHPUT AND POWER
Among the boundaries of the ROR, the TRL and its relationship with the constant-power curve have the most significant effects on throughput and power. This is because the most power-efficient operating point tends to be where these two curves meet.
The binormalized ROR
Whereas constant-power curves' qualitative shapes are dictated by CMOS fundamentals, the TRL curves' shapes result from platform design decisions. In the context of parallelization scaling, the TRL curve shapes influence the tradeoffs between throughput, power, and reliability.
Assuming k-scaling, we consider the general case in which k is a real number. Let
where V k and F k are the k-scaled voltage and frequency and α V and α F are the voltage-and frequency-scaling ratios. The unscaled switching power follows Equation 1, and throughput is
Assuming ideal scaling, the k-scaled power and throughput are
where α P is the power-scaling ratio and α θ is the throughputscaling ratio. The scaling point 〈V k , F k 〉 must fall within the ROR. Additionally, k must not exceed k max .
Power consumption Signal-to-noise ratio The graphs show the ROR with the signal-to-noise ratio (SNR) as the accuracy metric, assuming a reliability metric related to accuracy. Generally, no matter what the reliability metric is, a more stringent requirement shrinks the ROR while a more relaxed one expands it. For instance, if the SNR requirement is relaxed in Figure 2b , the ROR increases upward. F = frequency.
Instead of having to work in a platform's specific θ-V space (see Figure 4a ), ratios allow working in a binormalized α F -α V space (see Figures 4b through 4d) . This leads to platform independence and better comparisons between multiple platforms and between different scaling regions of the same platform. These commercial systems display higher V min (much higher than their threshold voltages) and steeper TRLs, resulting in smaller RORs after binormalization (see Figure   4c ), compared to the SRAM controller. Within these voltage ranges, switching power dominates.
Frequency vs. voltage in practical systems
For any platform, DVFS along different parts of the TRLs may result in different α V and α F values. Next, we look at the implications of this.
General k-scaling to reduce power or improve throughput
A popular measure for system efficiency is power-normalized throughput, which is the amount of computation per unit of energy. The unit for such a measure is instructions per second per watt (IPS/W). Here, this measure is given by θ/P before scaling and θ k /P k after scaling.
We can measure the effectiveness of k-scaling together with DVFS by comparing IPS/W before and after DVFS and k-scaling; that is, the larger η = (θ k /P k )/(θ/P) is, the better. From Equations 1, 3, 4, and 5, we derive
In other words, within the ROR, a smaller α V improves IPS/W when we consider only switching power and do not worry about specific throughput or power requirements. 
Normalized frequency How overclocking affects scalability, using the ARM big.LITTLE architecture as an example. The nonoverclocked F max for the A15 is 1.8 GHz; the corresponding binormalized curve matches that of the A7. When overclocked to 2.0 GHz, the A15 produces a shallower curve, indicating lower power efficiency at F max = 2.0 GHz. This results in greater scalability when scaling from this operating point rather than the nominal F max of 1.8 GHz.
So, the tendency would be to scale voltage down as far as possible (α V → min). In Figure 4b , scaling to α V2 is better than scaling to α V1 in terms of IPS/W. Binormalization helps to eliminate IPC (u) from the equation to allow for further comparison across platforms. Different platforms might have different TRLs. It is important to investigate how these boundaries, and other limits of the RORs, influence the effectiveness of k-scaling with DVFS.
Scaling along different TRLs to the same α V
Here, we compare two systems scaled to the same α V -a situation shown in Figure 4b with both platforms scaled to α V1 . This happens when V min constrains the reduction of α V . The platforms' different TRLs lead to different α F values: α F1 and α F2 . Both systems achieve the same IPS/W improvements given that their α V is the same.
From Equations 4 and 5, we can find P 1 and P k1 for system 1 and P 2 and P k2 for system 2. These k-scaling operations result in the following power-scaling ratios: Because both systems have the same α V and the same α P , they can achieve the same α θ . However, the platform with the greater α F has a smaller k. With k 1 and k 2 below k max , a smaller k implies using fewer hardware resources. If a platform's k is greater than k max , that platform cannot consume its entire power budget before exhausting its resources. The conclusion is that, assuming α V stays constant, a higher α F is better.
We confirmed this result by studying the A15 and A7 cores in an ARM big.LITTLE system. They both allow scaling the voltage down from their maximum V with a ratio of α V ≈ 0.80. At this point, the A7 cores have α F ≈ 0.57 and the A15 cores have α F ≈ 0.70. With k max = 4.00 for both core blocks, neither can use its entire power budget. But with the A15 scaled to k = 4.00, to get α P,A15 = α P,A7 , we must scale A7 to k = 3.60. The interpolated experimental data shows that (θ 4.00,A15 )/(θ 3.60,A7 ) = 0.70/0.57 ≈ 1.25.
Scaling along different TRLs to the same α F
We now compare two systems with different TRLs being scaled to the same frequency-scaling ratio α F , as shown in Figure 4b with both systems scaled to α F1 . In practice, this is related to systems that cannot scale below certain frequency values. In this case, α αα α αα
When scaling to the same power-scaling ratio, we have
This means that the system with the greater α V will have a smaller k. Because both systems have the same α F , this leads to a smaller α θ and smaller throughput gain (that is, the smaller α V is, the better). When scaling to the same throughput-scaling ratio α θ , we have α α = 
This means that the system with the greater α V will have a greater α P and therefore show more power dissipation after k-scaling. So, the smaller α V is, the better.
To verify this observation, we compared the A15 and A7 experimental data. Both core blocks allow scaling from 1,400 MHz to 800 MHz, but the A7 block gives a smaller α V (0.815 versus 0.881). The power advantage of this (predicted by Equation 7) was confirmed approximately from the experimental data with an error of 6 percent.
If we combine the findings from these investigations, considering only switching power and ideal scaling and from the point of view of extracting either power or performance benefits from k-scaling, we conclude that scaling should be done to the point at which α V is as small as possible and α F is as large as possible. The latter requirement means we should always scale along the TRL for any given system.
Nonideal scaling and heterogeneity
In real-world scenarios, ideal scaling with θ k = kθ is almost never achievable. A software execution might not be entirely parallelizable, and manycore hardware might suffer from a number of bottlenecks, most notably shared-memory and communication overheads.
To find the actual throughput, we use this speedup function:
Substituting Equation 8 for θ in the ideal-scaling equations in the previous sections expands them to cover general execution cases. Figure 5 shows three models for S(k). 3 Amdahl's law (see Figure 5a) computes the speedup with k cores, assuming a fixed-size workload. The parallelization factor (p) is the fraction of the workload executed in parallel; p = 1 is the ideal-scaling case. The law is famous for predicting that even a small drop in p causes the throughput to quickly saturate. 3 John Gustafson argued that speedup can scale linearly if the workload size increases with the number of cores, with the parallelizable portion increasing but the sequential portion staying the same (see Figure 5b) . 3 Xian-He
Sun and Lionel Ni expanded this idea toward a general metric g(k), showing how an algorithm's memory requirement scales relatively to the computation requirement (see Figure 5c ). ... is the speedup function, p is the parallelization factor, and g(k) is a general metric showing how an algorithm's memory requirement scales relatively to the computation requirement.
COMPUTER DESIGN STARTS OVER
and Ni confirmed that for g(k) ≥ k, linear or better-than-linear many-core scaling is achievable. Ideally, p should be a property of the algorithm. However, real-life devices also affect p owing to hardware-specific critical sections. We can use performance profiling to characterize nonideal scaling (see Table 1 ).
For systems with heterogeneous computation units (for example, the ARM big.LITTLE with different core types), k-scaling becomes a multidimensional optimization with the vector K = 〈k 1 , …, k X 〉 for X types of cores.
3

THE PER MODELING TOOL
The interplay models we have presented led to the development of our tool (see Figure 6 ), which is useful for reasoning in the ROR.
Idle-power consideration
In real systems, the power budget usually must cover not only switching power but also leakage power. To complicate the issue, not all switching in a system is attributable to any particular computation we want to study. For example, the power used by an OS that stays relatively constant no matter what application computation executes is switching power, but not related to any particular computation.
It is sometimes more convenient to group leakage power and any extra switching power not directly related to the computation as idle power, which affects the power budget as shown in Figure 7 . In this view,
where P comp is the power a single core uses for computation and P idle is the idle power. We can obtain P comp and P idle through the curve fitting of experimental data.
Tool description
Scaling and binormalized analytical models that include P comp and P idle can be derived similarly to the approach we described in the section "How the TRL Influences Throughput and Power." However, such models tend to be large and impractical. A practical solution is to determine the optimal operating point using discrete numerical solutions-for (a) (b)
DVFS points Computation power Idle power
Experiments and simulations
Specify the power or throughput constraints
Specify the scaling law (for example, Amdahl's)
The tool plots a PER diagram
The tool nds the optimal operating point (V, F, or the number of cores)
In the tool Characterization data example, searching through a limited number of fixed DVFS points and integer k values. Our tool employs this method.
The tool is equipped with experimentally measured data from the platforms we mentioned in the section "Frequency vs. Voltage in Practical Systems." Users can also provide their own data in the CSV (comma-separated values) format.
The tool plots the data in the θ-V space and solves one of these problems:
and F k such that P k is minimal while the total throughput still satisfies θ k .
› For the user-specified P max , find k, V k , and F k such that θ k is maximal while the total power stays under P max .
The tool supports the three models of nonideal scaling and allows the tweaking of all relevant parameters, such as p. O ur tool is available at www.async.org.uk/prime /PER/per.html. A video about the tool and the main engineering principles we described in this article can be found at www.computer.org/computer-multimedia.
We studied the effectiveness of DVFS together with parallelization scaling using power-normalized performance as an example criterion. Other criteria exist such as the energy-delay product, which has additional emphasis on throughput. The effectiveness metric η is different for different criteria. However, the method of comparing a criterion after k-scaling and DVFS to its previous value in the context of the binormalized F-V space is generally applicable. We plan to extend our results to other energy-performance criteria leveraging the binormalization method. 
