Estimation-theoretic framework for robust and energy-efficient system design by Narayanan, Sriram
c© 2010 Sriram Narayanan
ESTIMATION-THEORETIC FRAMEWORK FOR ROBUST AND
ENERGY-EFFICIENT SYSTEM DESIGN
BY
SRIRAM NARAYANAN
DISSERTATION
Submitted in partial fulfillment of the requirements
for the degree of Doctor of Philosophy in Electrical and Computer Engineering
in the Graduate College of the
University of Illinois at Urbana-Champaign, 2010
Urbana, Illinois
Doctoral Committee:
Professor Douglas L. Jones, Chair
Assistant Professor Rakesh Kumar
Professor Naresh R. Shanbhag
Professor Venugopal V. Veeravalli
ABSTRACT
A fundamental hurdle to realizing the exciting future applications of
embedded computing is lack of adequate power supply. Unlike the
exponential growth in computing capability, the improvements in power
sources have been lackluster. Technology scaling, driven by Moore’s law,
has produced smaller devices that can operate on lower supply voltages;
but as a side effect, nanoscale devices are becoming increasingly unreliable.
The resulting increase in transistor density further exacerbates the power
problem. Therefore, the computing industry faces a pressing need to
aggressively reduce power consumption and efficiently address error
resiliency.
Conventional approaches to error resiliency using redundant
computations have incurred the associated overheads of power and silicon
area. Traditional power reduction techniques scale supply voltage or clock
frequency to adapt to changing demands of the application, while being
limited to ranges where computation is free of error. Addressing in isolation
the related problems of power reduction and error tolerance may fail to
produce the gains required by future systems. It may be desirable to allow
occasional hardware errors for the sake of power savings; however, this
trade-off must be done without adversely impacting the end-user
experience.
Many applications in signal processing, communications, and multimedia
already allow several forms of noise, such as additive environmental noise,
interference, and quantization. This research views hardware error as a new
source of noise that is analogous to traditional forms of noise. In so doing,
it enables dynamically trading-off reliability for power savings while
meeting application performance requirements.
Our estimation-theoretic framework is a mathematical formalization that
allows us to state system-on-chip (SoC) design problems as constrained
ii
optimization problems. The engineering constraints, such as hardware
availability and cost, are explicitly captured as design constraints. By
accounting for application-level performance requirements, the framework
provides a notion of power, reliability, and performance optimality of the
design. The mathematical abstraction of the framework results in different
particular design techniques depending on the nature of the application.
We have identified four classes on the basis of these design techniques, and
described applications typical of each class.
For parallel and heterogeneous systems, an estimation-theoretic redesign
resulted in a 30%–40% power reduction in wireless and video systems. The
application-awareness characteristic of estimation-theoretic SoC design can
also be adopted in designing general-purpose processors. By exposing
architectural diversity and controlled hardware errors in logic, the
stochastic processor proposed here allows dynamic power reduction of about
20%–60% in the motion-estimation block of a video communication system.
In addressing power/reliability problems of general parallel SoCs, we have
also identified an important robust estimation problem that has remained
largely unaddressed within the robust statistics community. To address this
need, new methods for robust estimation with correlated observations were
developed that could be applicable to more general estimation problems.
iii
To my parents, for their love and support
iv
ACKNOWLEDGMENTS
I would like to express my gratitude to my adviser, Professor Douglas L.
Jones, for his unwavering support and encouragement over the years. I have
been additionally fortunate to receive mentorship and guidance from
Professor Naresh R. Shanbhag and Professor Rakesh Kumar. I am grateful
to Professor Venugopal V. Veeravalli for his insightful feedback on this
work. This work would not have been possible without close collaborations
with Dr. Girish Varatkar, Galen A. Lyle, and John Sartori.
I would like to thank Karan Bhatia, Kiran Lakkaraju, Roberto Lavarello,
Pavan Sannuti, and Shreyas Sundaram for their friendship and
participation in numerous, often tangential, and extrapolated discussions.
I’m deeply thankful to my parents and my sister Anu for their love,
support, and encouragement. Many thanks go to my wife Shyama for her
understanding and patience.
This work was sponsored by the Gigascale Systems Research Center
(GSRC), one of five research centers funded under the Focus Center
Research Program, which is a Semiconductor Research Corporation
program, and Texas Instruments, Inc. Special thanks are due to the
excellent technical support provided by the Coordinated Science
Laboratory and, in particular, Paritosh Garg at the Image Formation and
Processing group IT services. Finally, I would like to thank the ECE
Publications Office for their help with proofreading this document.
v
TABLE OF CONTENTS
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
CHAPTER 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . 1
1.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Dissertation Organization . . . . . . . . . . . . . . . . . . . . 7
CHAPTER 2 ESTIMATION THEORY FOR COMPUTATION . . . 10
2.1 Estimation Theory for Computation . . . . . . . . . . . . . . . 13
2.2 Computation Cost Function . . . . . . . . . . . . . . . . . . . 13
2.3 Architectural Design Space and Constraints . . . . . . . . . . 14
2.4 Canonical Problems of the Estimation-Theoretic Framework . 15
2.5 Applications of the Estimation-Theoretic Design Frame-
work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
CHAPTER 3 CLASS I: STATISTICALLY SIMILAR PARALLEL
SYSTEMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.1 Class Overview . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.2 Application: PN-Code Correlator . . . . . . . . . . . . . . . . 20
3.3 System and Noise Model . . . . . . . . . . . . . . . . . . . . . 20
3.4 Estimation-Theoretic Design . . . . . . . . . . . . . . . . . . . 23
3.5 Solution Method: Robust Statistics . . . . . . . . . . . . . . . 24
3.6 Hardware Implementation . . . . . . . . . . . . . . . . . . . . 27
3.7 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . 30
3.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
CHAPTER 4 CLASS II: GENERAL PARALLEL SYSTEMS . . . . . 36
4.1 Class Overview . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.2 Application: Multiantenna Communication Receiver . . . . . . 37
4.3 System and Noise Model . . . . . . . . . . . . . . . . . . . . . 40
4.4 Estimation-Theoretic Design . . . . . . . . . . . . . . . . . . . 54
4.5 Solution Method . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.6 Results: Symbol Estimation . . . . . . . . . . . . . . . . . . . 59
4.7 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
vi
CHAPTER 5 CLASS III: HETEROGENEOUS SYSTEMS . . . . . . 62
5.1 Class Overview . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.2 Application: Mobile Video Communication Systems . . . . . . 64
5.3 System and Noise Model . . . . . . . . . . . . . . . . . . . . . 64
5.4 Estimation-Theoretic Design . . . . . . . . . . . . . . . . . . . 70
5.5 Solution Method . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.6 Results: Power Savings . . . . . . . . . . . . . . . . . . . . . . 71
5.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
CHAPTER 6 CLASS IV: REDUNDANCY-AIDED SYSTEMS . . . . 73
6.1 Class Overview . . . . . . . . . . . . . . . . . . . . . . . . . . 73
6.2 Application: Word-Length Optimized ANT Systems . . . . . . 74
6.3 System and Noise Model . . . . . . . . . . . . . . . . . . . . . 75
6.4 Estimation-Theoretic Design . . . . . . . . . . . . . . . . . . . 75
6.5 Solution Method . . . . . . . . . . . . . . . . . . . . . . . . . 76
6.6 Results: Optimized Word Lengths . . . . . . . . . . . . . . . . 77
6.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
CHAPTER 7 SCALABLE STOCHASTIC PROCESSOR . . . . . . . 78
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
7.2 Scalable Architectures . . . . . . . . . . . . . . . . . . . . . . 81
7.3 Functional-Unit Architectures . . . . . . . . . . . . . . . . . . 83
7.4 Stochastic Applications . . . . . . . . . . . . . . . . . . . . . . 87
7.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
CHAPTER 8 CONCLUSIONS . . . . . . . . . . . . . . . . . . . . . 92
8.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
APPENDIX A PROOF OF THE EQUIPARTITION THEOREM . . 96
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
vii
LIST OF TABLES
3.1 Gate-level complexity of various components of the robust
PN acquisition system forM = 8 and 13-bit filter-bank outputs. 30
4.1 Example observation with outliers. The second, third, and
eigthth observations contain gross outliers. . . . . . . . . . . . 45
4.2 The correlation matrix was chosen to include some highly
correlated observations (i.e., 1–9) and some observations
that are uncorrelated (i.e., 10–16). This was meant to
study the effectiveness of our algorithm in both extreme
scenarios. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.3 The variance of the estimate for the different methods . . . . . 52
6.1 Optimal word-length choices for an ANT system resulting
from the estimation-theoretic design. . . . . . . . . . . . . . . 77
viii
LIST OF FIGURES
1.1 As the supply voltage is decreased, the gate delays of the
combinational logic elements increase. This delay increase
depletes any design slack time. The timing diagram shows
a negative slack time for the critical path. A timing error
occurs whenever this path is exercised. . . . . . . . . . . . . . 2
2.1 (a) Traditional NMR systems replicate computation a num-
ber of times and discard erroneous outputs using a majority
voter. (b) ANT systems employ a lower-complexity esti-
mator in place of replicated computation. (c) Our novel
view of complex SoCs identifies heterogeneous subsystems,
parallel subsystems (main computation decomposed into a
set of estimators), and subsystems with explicitly built-in
redundancy blocks (similar to ANT). In this way, this ap-
proach generalizes previous error-tolerance mechanisms to
fully exploit statistical correlations present within the SoC. . . 11
3.1 Class I systems benefit from multiple parallel estimates.
(a) shows the desired system that is free of any computa-
tional errors. (b) shows a system that is prone to errors
due to statistical variations. (c) shows a parallelized im-
plementation of (b) that takes advantage of multiple esti-
mates. These estimates are fused to produce a robust final
output. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.2 Polyphase decomposition of the matched filter yields mul-
tiple statistically similar estimates that are fused to obtain
a robust result. (a) Conventional matched filter sums the
sensor outputs. (b) Robust matched filter controls the in-
fluence of erroneous sensor outputs. . . . . . . . . . . . . . . 21
3.3 For  > 0, the mixture distribution is no longer Gaussian. . . 22
3.4 The 16-bit ripple-carry adder has a relatively graceful in-
crease in the probability of error as (a) the voltage over-
scaling factor is increased, or (b) as the standard deviation
of the gate delay under process variations is increased. . . . . 23
ix
3.5 (a) For  = 0, ψ function corresponding to the maximum-
likelihood estimate is linear. (b) For  > 0, the ψ function
for the maximum-likelihood estimate is a nonlinear func-
tion that has a clipping effect. . . . . . . . . . . . . . . . . . . 26
3.6 Matched filter for PN acquisition: (a) direct form and (b)
an SNC-based architecture. . . . . . . . . . . . . . . . . . . . 28
3.7 The one-step Huber fusion-block architecture. . . . . . . . . . 30
3.8 Simulation setup showing (a) various process and voltage
conditions for the process variations scenario and the VOS
scenario and (b) normalized delay distributions of various
gates for a 3σg slow die with variations at Vdd = 1.2 V. . . . . 31
3.9 ROCs of the PN-code detector at Kvos = 0.75 for (a) M =
8, and (b) M = 32 show that the SNC-based architecture
offers better PDet at all values of PF . . . . . . . . . . . . . . . 33
3.10 For approximately equal PDet, the SNC-based architecture
reduces power consumption by 36% with M=8 in (a) and
by 34% with M=32 in (b). . . . . . . . . . . . . . . . . . . . . 33
3.11 Under process variations, the ROC plots in (a) for M = 8
and (c) for M = 32 show that the SNC-based architecture
offers better PDet at all values of PF . The histograms of
PDet for a fixed PF in (a) and (b) show an improvement in
the mean PDet by approximately three orders of magnitude
in (b) and by approximately one order of magnitude in (d). . . 34
3.12 The SNC-based architecture reduces power consumption
by 31% for M = 8 in (a), and by 39% for M = 32 in (b)
at a fixed PF =0.05. . . . . . . . . . . . . . . . . . . . . . . . 35
4.1 The r receive antennas exploit spatial multiplexing of t si-
multaneous transmissions. The observation vector {y1, · · · , yr}
contains noise that is usually modeled as a multivariate
Gaussian random vector. . . . . . . . . . . . . . . . . . . . . . 39
4.2 The robustified cost function (a) has a minimum closer
to the underlying parameter (10) than (b) the Huber cost
function that ignores correlation information, and (c) the
traditional nonrobust least-squares cost function. . . . . . . . 44
4.3 The initial estimate (e.g., sample median) is used to robus-
tify the covariance matrix. This matrix is used as a fixed
approximation to the actual robust covariance matrix, Σ˜. . . . 47
4.4 Random variables drawn from a large-variance Gaussian
distribution represent outliers. With probability  an out-
lier is drawn according to this model. . . . . . . . . . . . . . . 50
4.5 Convergence to within 1e-3 of the result of the iterative
method is usually in fewer than two steps. The trial counts
on top of the bars are included for readability. . . . . . . . . . 53
x
4.6 Multiple antenna receiver provides r observations of the
desired symbol. The weights {w1, · · · , wr} are computed
by accounting for the correlation among the noise terms. . . . 55
4.7 In this fusion block, the median is used as a robust ini-
tial estimate. An update to the median is computed by
robustifying the observations and robustifying the covari-
ance matrix. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.8 With increasing voltage overscaling (top-left to bottom-
right), the timing errors become more frequent. The bit
errors also begin to occur in locations with lower significance. . 58
4.9 A comparison of the various robust techniques for cor-
related observations. The technique developed here out-
performs other known methods by a factor of 18 in some
voltage-overscaling regimes. . . . . . . . . . . . . . . . . . . . 59
5.1 Class III systems contain heterogeneous subsystems. Com-
putational errors in the different subsystems affect the over-
all system performance by different amounts. . . . . . . . . . . 63
5.2 This wireless video communication system consists of a
video encoder (A), a Reed-Solomon channel encoder (B),
and a wireless transmitter (C). These subsystems exhibit
different power/performance trade-offs and computational
errors in these subsystems that impact the overall system
performance by different amounts. . . . . . . . . . . . . . . . . 64
5.3 Encoder output rate grows as the motion-estimation sup-
ply voltage is scaled. . . . . . . . . . . . . . . . . . . . . . . . 68
5.4 The power consumed by the video encoder, the channel
encoder, wireless transmitter and the total system power
consumption are compared in Figures (a), (b), (c), and
(d), respectively. Relative power allocation for the different
subsystems varies depending on the range of communication. . 72
6.1 Class IV systems contain built-in redundancy. The decision
block computes the final output based on outputs of the
main block and the redundancy block. The redundancy
block may be designed to suffer fewer computational errors
by compromising fidelity (e.g., precision). . . . . . . . . . . . . 74
7.1 The scalable architecture introduces alternative functional
units at three levels. In (a), all the functional units of the
core are replaced by scaling friendly versions; (b) shows
two different functional unit architectures that can be se-
lectively used; and (c) shows a reliability-defined heteroge-
neous multi-core system. . . . . . . . . . . . . . . . . . . . . . 83
xi
7.2 (a) The RCA allows power/reliability trade-offs so that
power is reduced as the error rate is allowed to increase. (b)
The KSA, on the other hand, consumes less power for re-
liable operation, but does not allow power/reliability trade-offs. 85
7.3 Razor error recovery can provide some power savings for
gracefully failing designs (RCA) after the point of first er-
ror. However, these benefits are limited, since only a small
number of errors can be gainfully tolerated before recovery
overhead outweighs voltage-scaling power reduction. Note
that the quantity on the x axis is the error rate prior to
recovery. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
7.4 Module selection based on target error rate allows for power-
efficient operation in both reliable and unreliable opera-
tional phases. In (a), the power consumption vs. error-rate
profiles for the two modules are very different when no er-
ror correction is applied. In (b), an error-correction mech-
anism such as Razor is applied, and the resulting power
consumption of the two modules is shown as a function of
the error rate prior to correction. . . . . . . . . . . . . . . . . 88
7.5 The RCA is able to significantly lower the power consumed
(per adder) without compromising bit rate of the output.
But the KSA is able to offer about 20% lower power con-
sumption when no bit rate degradation can be allowed. . . . . 90
xii
CHAPTER 1
INTRODUCTION
The last several decades have witnessed a dramatic increase in computing
capabilities. Moore’s law, postulated in 1965, predicted that the number of
transistors in digital integrated circuits (IC) would double approximately
every two years. The ability to simultaneously scale transistor feature sizes,
switching speed, and power dissipation [1] has been responsible for this
exponential growth.
After more than 40 years of following this trend, the semiconductor
industry is now faced with some significant challenges. Battery capacity has
not improved enough to meet the aggressive demands of modern IC
systems. Power reduction remains an issue even in mains-powered systems
because of cooling and packaging costs associated with highly dense IC
systems. We can continue to reap the benefits of Moore’s law only if we are
able to address the associated power challenges.
Moreover, continued technology scaling has caused several reliability
issues. With shrinking feature sizes, process and environmental variations
are beginning to have an increased impact. Process variations are the
fluctuations of the parameters of a transistor, such as its dimensions and
threshold voltage, across the different parts of a chip. These fluctuations
result from variations in the manufacturing process parameters (such as
impurity concentration densities and oxide thicknesses) and limited
resolution of the photolithographic process [2]. Process variations are
present both from one die to another and within a die, making it difficult to
accurately create statistical models. Environmental variations refer to the
changing ambient conditions to which a chip in operation is exposed.
Changes in temperature, fluctuations in power supply, and variability in the
activity factor, all contribute to environmental variations.
Variations can cause actual gate delays to differ from the design
specifications. The delays of the constituent gates greatly influence the
1
Figure 1.1: As the supply voltage is decreased, the gate delays of the
combinational logic elements increase. This delay increase depletes any
design slack time. The timing diagram shows a negative slack time for the
critical path. A timing error occurs whenever this path is exercised.
choice of clock frequency when designing sequential circuits. Deviations of
gate delays from specification can result in violation of timing constraints.
Figure 1.1 illustrates how an increase in gate delays can result in timing
violations in critical paths. Since commonly implemented architectures
perform computations in a least-significant-bit-first manner, timing-related
errors tend to occur in the most significant bits. This will cause the
computation result to be grossly incorrect. Because they depend heavily on
the particular logic path that is exercised, timing errors depend on the
inputs.
High transistor densities and low supply voltages have also worsened the
rates for atmospheric particle-hit-induced soft errors [3]. Unlike timing
errors, soft errors are input independent, and may be modeled as random
bit-flips. Tolerating soft errors is another challenge of nanometer system
design.
The reliability issues discussed above relate to silicon-based devices.
Research in new materials and device technologies has yielded some
promising alternatives to silicon-based devices. As an example, carbon
nanotubes have emerged as a potentially alternative device fabric. However,
2
current research indicates that carbon-based devices will also be prone to
many reliability issues [4]. Therefore, it will be important to address
reliability issues even in these future technologies.
The traditional approach to mitigating the impact of variations has been
to design for the worst-case corner and allocate design guard-bands by
overprovisioning clock frequency or supply voltage. As the magnitude of
variations continues to increase, this pessimistic approach is beginning to
become unaffordable.
The future of computing relies on efficiently addressing these power and
reliability problems. It is important to realize that power and reliability are
closely coupled issues. In the absence of a power budget, arbitrarily high
reliability can be achieved by applying fault-tolerance techniques at several
levels – ranging from the circuit fabric, to the architecture and the
application. In fact, many mission-critical applications routinely adopt this
strategy. Likewise, low power consumption can be achieved by
compromising system reliability. An example of this trade-off is aggressive
voltage or technology scaling that can result in timing errors. However,
operating slow transistors (caused by process variations) at a higher supply
voltage can alleviate their delay issues. By jointly addressing power and
reliability, this unifying framework allows the system designer to strike
optimal trade-offs.
The work presented in this dissertation applies to classes of applications
common in signal processing, wireless communication, and multimedia.
Applications in these classes are characterized by the presence of
measurement and environmental noise, and therefore provide an
opportunity to additionally allow hardware errors. The design philosophy
adopted here is to use existing computational blocks and develop
post-processing methods to overcome reliability issues. The methods of this
dissertation introduce few modifications to the computational blocks of the
conventional system and add no additional overhead. They either modify
the post-processing blocks that already exist in the conventional design, or
expose the application to hardware errors. We think that this is a strength
of the framework because by operating on existing operations, it is not
limited to any particular error-resilient architecture, such as Razor [5],
Biser [6], ANT [7], etc. This architecture neutrality means that the benefits
highlighted here can be extended to new systems by simply characterizing
3
the power/reliability trade-offs of the subcomponents. The framework
presented in this dissertation allows the designer to globally optimize over a
number of possible architectural choices.
Another aspect of the generality of this work is related to the source of
hardware errors. Voltage overscaling is often used here as a means of
altering the power/reliability behavior of various architectural blocks,
typically leading to timing errors, as shown in Figure 1.1. The system-level
techniques developed here are not limited to any particular source of timing
errors – process variations, clock and supply voltage variations, and
aggressive scaling.
This work is distinct from other research in fault-tolerant computing in
an important way. This work demonstrates the usefulness of exposing
timing errors occurring in hardware to several important classes of
applications. In contrast, traditional fault-tolerant computing seeks to
avoid timing errors from percolating up to the application.
1.1 Related Work
Error-resilient computing has long been a topic of extensive research [8].
However, the dual problems of particularly severe power-constraints and
variations-related reliability issues are unique to the present decade. As
such, traditional methods tend not to apply well to this context. This
section surveys prior work in reliability and power management and
presents a case for new approaches.
1.1.1 Dynamic voltage/frequency scaling
Traditional dynamic voltage/frequency scaling (DVFS) techniques offer
power savings by adjusting the supply voltage or clock frequency according
to computational demand [9]. The research in DVFS exploits the fact that
supply voltage scaling can be performed by small amounts without
excessively increasing gate delays.
The work in [10] addresses the problem of selecting operating parameters
such as supply voltage and adaptive body bias to minimize system energy
consumption. But the power and performance constraints of modern IC
4
systems may require scaling close to critical limits where occasional timing
errors may occur.
1.1.2 Circuit-level techniques
Mechanisms to tolerate errors can operate at the circuit level, the
architecture level or the system level. Circuit-level approaches are
independent of particular architectures and applications and offer a reliable
computing platform. The design approach in [11] is based on neural
networks and is able to tolerate multiple, simultaneous design failures. The
probabilistic computation approach in [12] proposes using highly unreliable
devices by applying the theory of Markov random fields. The work in [13]
seeks to benefit from the statistical behavior of gates by mapping
probabilistic algorithms onto them.
These methods suffer from significant area overhead that may have the
effect of undoing the benefits of Moore’s law.
1.1.3 Architecture-level techniques
Architecture-level techniques seek to benefit from timing slack that may be
available in some logic paths. The most prominent among these is the
concept of better-than-worst-case design [14, 15]. The task of checking the
correctness of computations is assigned to a separate checker block that is
designed to be small, error free, and capable of operating as fast as the
main block. In [5], this design philosophy is applied by operating a set of
parallel latches, called shadow latches, at a slight delay offset from the main
latches. This creates an opportunity to detect and handle timing errors.
The work in [16] uses a priori characterization of process variations to
develop microarchitectural techniques to handle variations-related errors.
The heavy dependence on particular models for power and variations tends
to limit the benefits of these approaches. In addition, because these
techniques depend on errors, they are limited to regimes with moderate
error-rates.
Designing architectures for soft errors has also been a topic of extensive
research [17]. Researchers in this area have proposed several device-level
5
and circuit-level techniques of modeling and mitigating soft errors, and they
have developed tools to analyze the vulnerability of architectures to these
errors.
Research on fault-tolerant architectures focuses on general-purpose
computing systems in which the application cannot allow any hardware
errors. But many signal processing, wireless communication, and
multimedia applications may be able to tolerate some amount of hardware
error, and exploiting this additional slack can result in better power and
reliability trade-offs.
1.1.4 System-level techniques
The algorithm-based fault-tolerance approach in [18] encodes the input
data, redesigns the algorithm to operate on the encoded data, and
distributes the computation among multiple units. This technique is
particular to the specific (matrix-based) computations. By exploiting
statistics of the noise and hardware, the methods in this dissertation are
applicable to larger, statistically characterizable classes of applications.
More recently, the application-level methods in [19] use knowledge of the
application to selectively protect critical variables in a software
implementation. They derive the properties of these variables and use this
information to build checks for detection of errors. A runtime checker
(typically a hardware appendage) executes these checks. The work in this
dissertation is fundamentally different in that it demonstrates the ability of
broad classes of applications to allow hardware timing errors alongside
other sources of system or measurement noise. In many SoC contexts, the
fraction of critical variables as proposed in [19] may be large, and allowing
controlled amounts of timing errors in such applications may prove to be
more effective. As opposed to these related works in error-correction, the
work in this dissertation seeks to allow moderate amounts of hardware
errors and jointly optimize with other sources of noise.
System-level error-tolerance approaches have the potential to minimize
correction overhead by exactly meeting the reliability needs of a given
application. The earliest approaches in this class include N -modular
redundancy systems. Voting among a set of repeated computations can
6
help overcome some types of random errors, such as soft errors. However,
this strategy may fail with input-dependent errors. Recent extensions to
this idea, termed soft NMR [20], use explicit knowledge of the statistics of
hardware timing errors to choose between the redundant computations.
Other extensions to traditional N -modular redundancy include fluid NMR
[21] techniques that dynamically reconfigure a multi-core processor to meet
changing power/reliability constraints of the application.
Despite these recent advancements, recomputation represents significant
overhead. The algorithmic noise tolerance (ANT) [7] techniques were an
early means of reducing recomputation overhead. Here a lower-complexity
estimator block is operated in parallel with the main block. The main block
is designed for the average case and is prone to errors, but the estimator
block remains immune to errors because of its lower-complexity. Some
examples of the lower complexity estimator blocks include reduced precision
replicas and linear predictors. A decision step compares the outputs of the
main block and the estimator block. If the main block is found to be in
error, the estimator’s output is used in its place. While ANT techniques
have reduced the amount of redundant computation, they have not
eliminated them. The direction pursued in this work is to investigate if the
elements of a given computation can themselves be used to gain sufficient
robustness to meet the application demands.
1.2 Dissertation Organization
The work presented in this dissertation seeks to exploit statistical relations
that may already exist within a computational block. The ability to
integrate extremely high densities of devices on a chip has led to the
phenomena of systems-on-chip (SoCs). The outputs of the subcomponents
of an SoC contribute to the final result. An efficient design will extract all
available statistical relationships among these subcomputations and avoid
explicit recomputation of the final result. A novel view of hardware errors
as a new source of noise enables us to draw from conventional estimation
theory to address this important power/reliability issue. Stating SoC design
problems in an estimation-theoretic framework also defines a notion of
optimal system design that earlier techniques such as [7] lacked. The
7
estimation-theoretic framework is a mathematical abstraction that helps
system designers to make use of powerful techniques from estimation
theory. The different general techniques that result from this framework
define various classes of applications. The specific classes presented here are
chosen to demonstrate some important methods and are not meant to be a
complete taxonomy of applications.
General-purpose embedded processors represent an important and
growing category of computing platforms. Their ease of programmability
makes them especially well suited for multimedia, wireless, and other
applications in which the standards and protocols change frequently.
Another research direction addressed in this dissertation is the applicability
of system-level error-tolerance approaches in general-purpose processors.
This dissertation is organized as follows:
• Chapter 2 presents the estimation-theoretic framework. It defines
the necessary elements of the framework that capture engineering
constraints such as power, cost, and impact of errors. The general
concepts presented in this chapter are not limited to any particular
class of applications.
• Chapter 3 addresses the class of statistically similar parallel
systems. The subcomponents of these system are architecturally
identical to each other and to the overall system. A particular
limitation of these systems is that measurement noise present in the
subcomponent outputs must be independent. Examples of systems in
Class I include parallel implementations and single-instruction
multiple-data architectures.
• Chapter 4 extends the class of statistically similar systems to
include more general parallel systems in which the subcomponent
noise samples may be correlated. Relaxing this restriction of Class I
systems required new extensions to the methods of classical
parameter estimation theory. The estimation methods presented in
this chapter will apply to the scope of more general signal processing
applications in Class II.
• Chapter 5 addresses estimation-theoretic design of systems
containing a heterogeneous subcomponents. The methods presented
8
in Class I and Class II required multiple simultaneous estimates of the
final result. Heterogeneous systems of Class III, such as the mobile
video communication system presented here, do not offer such
estimates. The methods developed here exploit the different
sensitivities of the different sub-components to power reduction
techniques in order to minimize overall power consumption.
• Chapter 6 demonstrates the usefulness of the estimation-theoretic
framework in applications that may already contain explicit
redundancy. In particular, this chapter shows how the
estimation-theoretic framework applies to existing techniques such as
ANT.
• Chapter 7 presents a dynamically scalable processor that has the
ability to expose different functional units based on changing
performance and reliability demands to the application.
• Chapter 8 provides a summary, concludes this dissertation, and
discusses important future extensions of this work.
9
CHAPTER 2
ESTIMATION THEORY FOR
COMPUTATION
Driven by Moore’s law, technology scaling has continued to increase the
density of integrated circuits (ICs) leading to the realization of
systems-on-chip (SoCs). Increased transistor density, limited improvements
in battery technology, and demand for high performance have made power
reduction an important concern. At the same time, scaling in the
nanometer regimes has introduced numerous sources of non-idealities such
as process variations and soft errors [3]. Reliability and power reduction are
often related problems because the ill effects of some of these non-idealities
may be overcome by overprovisioning power or operating at slower clock
frequencies. But such techniques are often unacceptable in applications
with severe power constraints. Therefore, we need novel approaches to
striking an optimal power/reliability trade-off.
Traditional dynamic voltage and frequency scaling techniques offer power
savings by adjusting the supply voltage or clock frequency according to
computational demand [9]. These methods exploit the fact that moderate
amounts of supply voltage scaling can be performed with only modest
controlled performance degradation in terms of increased gate delays.
Future SoCs demand more power reduction, and hence such error-avoidance
techniques are inadequate.
Error-tolerant approaches target more aggressive power savings and
robustness by allowing occasional hardware errors and introducing either
circuit-level or algorithm-level redundancy. Redundant residue number
systems provide circuit-level error tolerance [22] by representing an integer
by a redundant set of its remainders. Special modulo-arithmetic-based
error detection and correction circuitry exploits this redundancy to
overcome computational errors. The work in [22] applied this technique to
error-tolerant finite impulse response (FIR) filtering. The computational
and memory overhead of the error detection and correction subsystems is
10
Figure 2.1: (a) Traditional NMR systems replicate computation a number
of times and discard erroneous outputs using a majority voter. (b) ANT
systems employ a lower-complexity estimator in place of replicated
computation. (c) Our novel view of complex SoCs identifies heterogeneous
subsystems, parallel subsystems (main computation decomposed into a set
of estimators), and subsystems with explicitly built-in redundancy blocks
(similar to ANT). In this way, this approach generalizes previous
error-tolerance mechanisms to fully exploit statistical correlations present
within the SoC.
often substantial. At the algorithm level, approaches such as N -modular
redundancy (shown in Figure 2.1(a)) impose substantial power, area, and
cost overheads.
Recent approaches reduce system power consumption by designing for
the average case and employing special mechanisms to handle occasional
hardware errors. The better-than-worst-case design techniques [14]
introduce a lower-complexity checker block to recompute a main block’s
result. If the main block is found to be in error, either the result computed
by the checker block is used [23], or the system is rolled back to a
previously stored safe-state [5]. Algorithmic noise-tolerance (ANT) [24] is
another related average-case design technique; such systems (see
Figure 2.1(b)) exploit prior knowledge regarding the statistics of the inputs
and hardware errors. While ANT systems in [24, 25] do exploit statistics
and employ estimation theory for signal-processing kernels, they lack a
11
notion of system-level optimality. The estimation-theoretic framework
presented in this work defines a notion of optimal design for robust
systems. This mathematical formalization points the system designer to
optimal robust designs that fully exploit all available statistical information
regarding the application and the system.
In order to reap the full benefits of aggressive technology scaling, it is
desirable that modern SoCs are designed to tolerate the maximum number
of hardware errors and incur the least amount of overhead while
guaranteeing minimum acceptable performance specifications of the
application (e.g., bit-error rate). The framework presented in this chapter
allows the system designer to define and engineer optimal SoCs by fully
exploiting any available statistical information regarding the application
and nature of hardware errors.
The problem of maintaining system robustness in the presence of
component failures occurs in other contexts. Sensor networks have
traditionally been used to achieve robust estimation of physical phenomena,
even when some of the constituent nodes may fail. By allowing node-level
failures, sensor networks have proven to be energy efficient [26]. Estimation
algorithms that tolerate noisy measurements at the sensor nodes are the
enabling technology for sensor networks.
Complex SoCs typically comprise many smaller components, and one
may view these subsystems as collaborating nodes that collectively produce
the final output. The notion of on-chip networks is becoming increasingly
commonplace in the literature [27]. Viewing SoCs as networks leads to a
novel link to estimation theory and enables the system designer to identify
and exploit redundancy available within the system and to minimize or
eliminate the need for explicit redundant computation. Figure 2.1(c) shows
our novel view of modern SoCs as a network of subsystems. This network
view exposes new opportunities for improving system robustness. By
treating hardware errors as computational noise that is analogous to system
or measurement noise, we may apply results from estimation theory to
design robust SoCs. The proposed estimation-theoretic framework provides
the necessary formalization for practical design optimization.
This chapter is organized as follows. Section 2.1 presents an
estimation-theoretic view of computation that helps design systems that are
robust to hardware errors. Section 2.5 identifies three distinct classes of
12
applications that can benefit from this estimation-theoretic view.
Section 3.2, Section 5.2, and Section 6.2 detail a specific application
example from each of the identified classes that demonstrate concrete
improvements in system robustness and power reduction: Section 3.2
presents a low-power and robust design of a PN code correlator commonly
used in spread-spectrum applications, Section 5.2 presents an optimized
end-to-end mobile video communication system, and Section 6.2 presents
an application-specific optimized design of an ANT system.
2.1 Estimation Theory for Computation
When viewing computation as a special case of estimation, soft errors and
hardware errors due to process variations and voltage/frequency scaling are
analogous to measurement or system noise. This analogy enables us to
leverage many well-known algorithms in estimation theory. Traditional
estimation theory deals with the problem of optimally determining an
underlying parameter or signal from a set of noisy measurements. Our
design philosophy is to gain maximum robustness by using only these
already available subcomputation results. In the proposed view of
computation, we treat the subsystems of a complex SoC as providing noisy
estimates of the overall computation. The problem is to optimally find the
final result based on these estimates. Traditional estimation theory
formalization is unconcerned with engineering constraints of arriving at an
estimate. However, in the present context, we are interested in the
complexity of the estimator and, therefore, require a non-trivial extension
of estimation theory to robust SoC design [28].
2.2 Computation Cost Function
Bayesian estimation theory seeks to minimize the average risk of
misestimating a signal or parameter, given a set of its noisy measurements.
In the context of SoC design, this translates to minimizing the risk of
miscomputing the final result based on a vector of error-prone
subcomputation results, ~Y = {Y1, · · · , YN}. The cost function used to
13
compute this computational risk usually depends on the specific
application. For a given computation, let θ denote the desired result of
computation, and θˆ(~Y ) denote the error-prone output of the SoC. The
application-specific cost is determined by θ and θˆ. The risk function is the
expectation of this cost function over the random variable θ and the
observed variables ~Y .
Different computation problems may call for different cost functions.
Squared-error cost functions may be appropriate for the many
signal-processing applications in which it is common to minimize mean
squared error (MSE), as given by
C[θˆ(~Y ), θ] = (θˆ(~Y )− θ)2
General-purpose computing systems may follow a model in which only
errors up to some limit, ∆, may be tolerated. For such systems, we suggest
the following 0-1 cost function:
C[θˆ(~Y ), θ] =
1, |θˆ(~Y )− θ| ≥ ∆0, |θˆ(~Y )− θ| < ∆
Another common cost function is the absolute error:
C[θˆ(~Y ), θ] = |θˆ(~Y )− θ|
2.3 Architectural Design Space and Constraints
Designing SoCs presents unique issues that need to be addressed. Cost and
technology considerations may limit the system designer to a finite number
of architectures or technology choices. For a given architecture, the choices
for various operating parameters such as supply voltage, clock frequency,
and register word lengths may also be constrained. Because these
parameters have a direct impact on both the system power consumption
and hardware error rate, it is important to optimally choose them.
Therefore, we need a general estimation-theoretic framework that optimizes
system performance or power consumption; while accounting for design
constraints. The following example scenario illustrates a way of quantifying
14
the system design space.
Example 1. The application developer has available a system
consisting of a set of N identical processing elements, e.g., a
multi-core system. (This may be a common scenario since such
highly parallel designs have recently become rather inexpensive.)
The circuit may be operated at adjustable supply voltage Vdd,
with each processing element operating at a clock frequency
chosen from the set F = {fi}.
The architectural constraint of the estimation-theoretic framework is
specified by means of a discrete set of choices, A. Each element of this set
can be configured by a continuously variable parameter vector, ~λ (for
example, the supply voltage, Vdd, in Example 1), or
A = {A1(~λ), A2(~λ), . . .}
where the Ai in Example 1 would be F ×Non, where Non is the number of
active processing elements.
2.4 Canonical Problems of the Estimation-Theoretic
Framework
The estimation-theoretic computational system design optimization may be
stated in two canonical problems. The first problem seeks to minimize
power consumed in arriving at a computational result (analogous to
estimate), while constraining the average system accuracy (analogous to
average risk). Let θ be the result being computed that belongs to some set
Λ, and let θˆ be the estimator that operates on the input, ~Y . We can state
this problem as follows:
15
Problem 1: Performance-constrained system.
θˆ(~Y ) = arg
{
min
θ∈Λ,A∈A
P (θˆ(~Y ))
}
subject to
Eθ{C[θˆ(~Y ), θ]} ≤ CTarget
A = {A1(~λ), A2(~λ), . . .}
(2.1)
where A is the set of architectural choices, the vector ~λ defines
the tunable parameters of an architecture (e.g., supply voltage
or clock frequency), and the function, P (·), computes the power
consumed in arriving at an estimate.
Problem 2.2 seeks to minimize the average risk incurred in misestimating
the result while constraining the power consumed to be within budget. For
a battery-operated system, for example, the amount of stored energy may
be used to arrive at a power budget. Cooling and packaging costs may
define this budget for tethered systems.
Problem 2: Power-constrained system.
θˆ(~Y ) = arg
{
min
θ∈Λ,A∈A
Eθ{C[θˆ(~Y ), θ]}
}
subject to
P (θˆ(~Y )) ≤ PBudget
A = {A1(~λ), A2(~λ), . . .}
(2.2)
2.5 Applications of the Estimation-Theoretic Design
Framework
The canonical problems of the estimation-theoretic framework are very
general, and different applications may lead to very different estimation
problems. The value of the abstraction presented through the canonical
problems is that it allows the system designer to quickly recognize relevant
estimation-theory results and apply them to many important system-design
problems. We present four classes of systems in which this framework offers
16
marked improvements in robustness. These classes are defined by the
interrelationships among the subsystems of a system.
17
CHAPTER 3
CLASS I: STATISTICALLY SIMILAR
PARALLEL SYSTEMS
The estimation-theoretic framework of the previous chapter is a
mathematical abstraction. This and subsequent chapters demonstrate this
framework in real applications by presenting specific design techniques.
This chapter focuses on an important class of parallel systems containing
identical subblocks. A particular restriction of this class is that under
hardware-error-free operation, the subcomputations exhibit identical and
independent additive noise statistics.
3.1 Class Overview
The first class of systems is implemented as a set of parallel, statistically
similar subsystems. Figure 3.1 shows such a decomposition of a traditional
system. In contrast to the traditional system that produces a single output
based on its inputs, the decomposed system produces a set of parallel
results that are combined to yield the final result. In such systems, the
input may be divided into subsets that may be processed by a set of
parallel processing elements (PEs), {PE1, · · · , PEm}. The results,
{θi, i ∈ {1, · · · ,m}}, act as error-prone measurements, ~Y , of the final
output, θ. By parallelizing the original system, we can gain redundancy
from the multiple statistically similar estimates and overcome most of the
computational errors without incurring the penalty of explicit redundancy.
The framework presented in this chapter produces the optimal design of a
fusion center that combines the estimates while guaranteeing that the final
output meets the application performance specification (such as BER or
mean-squared error). The PEs are subject to the same application cost
function, C(θ, θˆ(~Y )). The system-level application cost function, C(θ, θˆ), is
18
��������
(a) Desired System
(b) Conventional Error-Prone System (c) Statistically-Similar Parallel System
Final Output 
Figure 3.1: Class I systems benefit from multiple parallel estimates. (a)
shows the desired system that is free of any computational errors. (b)
shows a system that is prone to errors due to statistical variations. (c)
shows a parallelized implementation of (b) that takes advantage of multiple
estimates. These estimates are fused to produce a robust final output.
given by
C(θ, θˆ) =
1
m
m∑
i=1
C(θi, θˆi)
where C(θi, θˆi) is the application cost function of the i-th PE.
The system power consumption, Psys, is given by
Psys =
m∑
i=1
PPEi + Pfusion
where PPEi is the power consumed by the i-th processing element and
Pfusion is the power consumed by the fusion center.
Parallel systems belonging to this class are common in many
signal-processing applications. Finite impulse response (FIR) filters are
often parallelized using polyphase decomposition [29]. Parallel
divide-and-conquer techniques concurrently perform a set of
subcomputations and later merge them to produce the final result. These
techniques are commonly used to solve problems in computational
geometry in a variety of fields such as pattern recognition, image
19
processing, computer graphics, and VLSI design [30].
3.2 Application: PN-Code Correlator
Spread-spectrum communication systems use FIR filters as matched filters
for pseudo-noise (PN) codes when identifying a user in a multiple access
channel. The receiver correlates the noisy received signal with a local code,
and this output is then processed by a detector [31]. The peaks in the
output of the matched filters are used for detection and synchronization of
PN sequences, a procedure called code acquisition. This code acquisition is
a computationally critical block in a spread-spectrum communication
receiver [31].
3.3 System and Noise Model
Matched filters arise in many circuit designs and often consume most of the
chip area and power. Parallel implementations of matched filters through
polyphase decomposition are often used to increase throughput or reduce
power consumption [29]. Figure 3.2 shows an implementation of a parallel
matched filter. In the event of a successful match, the signal components of
the outputs of the filter banks of a polyphase matched filter are completely
correlated. Such a statistical relationship between the subcomputations of
the filtering operation can be exploited to gain error robustness and power
reduction without the need for redundant computation.
For this application example, we use voltage overscaling (VOS) [24] as an
error-prone power-reduction technique. Due to the slower computations not
meeting timing constraints, VOS results in increased errors. Figure 1.1
shows how an increase in gate delays can result in timing violations in
critical paths. These violations cause timing errors when those paths are
exercised. If the critical paths are exercised only infrequently, then the
timing errors will occur only occasionally. While voltage overscaling is one
cause of timing errors, the methods presented in this chapter are generally
applicable to timing errors caused by other phenomena, such as process
variations and soft errors.
20
D
Dh0
D
Dh1
D
Dh2
D
Dh3
Se ns o r3
Sensor 4
M
M
M
D
D
D
M
Sensor 2
ensor 3
Sensor 1
0 5 10 15 20
Threshold
(a) Adder
(b) Robust
Fusion Center
(OR)
Detector
Figure 3.2: Polyphase decomposition of the matched filter yields multiple
statistically similar estimates that are fused to obtain a robust result. (a)
Conventional matched filter sums the sensor outputs. (b) Robust matched
filter controls the influence of erroneous sensor outputs.
The relationship between supply voltage and timing errors has been
previously studied in [32]. The VOS errors tend to be large in magnitude
because timing errors affect the most significant bits of computations that
are performed in an LSB-first manner. The output is therefore
contaminated by a mixture of the noise already present in the input and
the large-magnitude VOS errors. In the presence of hardware errors that
occur with probability , the output can be modeled as random variables
drawn from a class of distributions that is Gaussian with probability (1− )
and some unknown distribution with probability  for some 0 <  < 1, as
shown here,
P = {F |F = (1− )Φ + H,H ∈ R} (3.1)
where Φ is the class of standard normal distributions (a good model for
input or measurement noise), H is the class of arbitrary densities with zero
mean and finite but unbounded variance (used to model hardware errors
such as those caused by timing violations), and R is the set of all
probability measures on the real line. An element of the set F is the
probability distribution of the computational noise (i.e., a combination of
system or input noise and hardware errors caused by power-reduction
schemes). Figure 3.3 shows example distributions for different values of .
21
−10 −5 0 5 10
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
x
f(x
)
 
 
ε = 0
ε = 0.25
ε = 0.5
ε = 0.75
ε = 1
Figure 3.3: For  > 0, the mixture distribution is no longer Gaussian.
It is important to note that because the exact probability model of the
hardware errors may be unknown and time varying, this mixture model
allows us to design for the worst-case error model. Adopting the mixture
model in (3.1) makes our design methodology indifferent to any specific
error model.
Architecture-level timing errors
Timing errors may be caused by process variations or by voltage overscaling
and depend on the input. The probability of the occurrence of timing errors
will depend on the path-delay distribution of the architecture and the pdf
of the input. Figure 3.4(a) shows the probability of VOS errors as a
function of the VOS factor for a 16-bit ripple-carry adder implemented in
the IBM 130 nm process technology. The probability of error due to process
variations (process) for the ripple-carry adder is shown in Figure 3.4(b) as
a function of the standard deviation of the gate delay due to variations, σg.
Here the supply voltage is chosen to avoid errors at the nominal process
corner.
22
0.5 0.6 0.7 0.8 0.9 1
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
KVOS
ε 
(vo
s)
 
 
16−bit Ripple Carry Adder
(a)
0 0.5 1 1.5 2 2.5 3
0
0.5
1
1.5
2
2.5
3
3.5
4
x 10−3
slow process corner (x σg)
ε 
(P
roc
es
s)
16−bit Ripple Carry Adder
(b)
(a) (b)
Figure 3.4: The 16-bit ripple-carry adder has a relatively graceful increase
in the probability of error as (a) the voltage overscaling factor is increased,
or (b) as the standard deviation of the gate delay under process variations
is increased.
3.4 Estimation-Theoretic Design
Because this is a filtering application, we adopt a mean squared error cost
for this system. Since mean squared error can be decomposed into a bias
term and a variance term, and minimizing bias may lead to unrealizable
estimators, we restrict ourselves to unbiased estimators [33]. Therefore, the
cost function for this application minimizes the variance of the estimate. In
this polyphase implementation of the matched filter, the outputs of
individual PEs (i.e., the filter banks in Figure 3.2) are summed to produce
the final output. Thus, for each PEi,
C(θi, θˆi) = Eθi{θˆi − θi}2
The total system cost function is given by
C(θ, θˆ) =
1
m
m∑
i=1
C(θi, θˆi)
We identify the architectural space to consist of N identical processors
because of the optimality proven in Corollary 1 (see Section 3.5) and the
low design cost and commercial availability of such multiprocessors. The
architectural constraint set, A, is the discrete set of voltages, V, at which
23
the N identical processing elements may be operated.
The problem in this application corresponds to Problem 2.2 of the
estimation-theoretic framework. The estimation-theoretic risk function in
this application is given by
C(θ, θˆ) = Eθ{θˆ(~Y )− θ}2
where ~Y denotes the subcomputation results used to arrive at the final
estimate, θˆ, and θ is the desired output. We seek to minimize the maximum
asymptotic variance of the computation result while constraining the power
consumption of the robust PN-code correlator to a given value less than
that of a traditional PN-code correlator. This can be stated as
θˆ(~Y ) = arg
{
min
θ∈Λ
Eθ{θˆ(~Y )− θ}2
}
subject to
P (θˆ) ≤ Ptrad − Psavings
A = V (3.2)
where Ptrad is the power consumed by a traditional system, Psavings is the
targeted power savings, and V is the set of supply voltages at which the
system may be operated.
3.5 Solution Method: Robust Statistics
The following model can be used to describe the outputs of the
subcomputations:
Yi = θ + ηi for 1 ≤ i ≤M (3.3)
where M is the number of filter banks, Yi is the i-th filter-bank output, θ is
the actual output, and ηi is the additive noise that includes hardware
errors. Hardware errors due to process variations or voltage overscaling are
assumed to occur with known probability, . With this model, ηi is assumed
to be drawn from the class of mixture distributions in (3.1).
In order to recover from timing errors, we need to process appropriately
24
the outputs of the filter banks. If the outputs are merely summed, as is
common in polyphase decompositions, timing error will cause gross errors in
the system output. Therefore, we need a more sophisticated fusion method.
A common problem in statistics is that of inferring a parameter based on
a set of its observations that may be corrupted by random noise drawn
from some probability distribution. In practice, this probability distribution
is seldom known a priori; one is forced to work with an assumed
distribution. The quality of the statistical inference often suffers when the
actual pdf deviates from the assumption. Robust statistics is the science of
building in insensitivity of the estimate to small deviations of the noise
statistics from its assumed distribution [34]. An inference method is said to
be robust if it exhibits optimal or near-optimal performance when the
assumed model is correct, the performance worsens only slightly when the
deviation is mild, and large deviations from the assumed model do not
cause drastic performance losses. For such problems, it is common to use
the asymptotic variance of the estimate as a performance measure [34].
In the absence of timing errors, a maximum-likelihood estimate is
optimal in the sense of minimum asymptotic variance [33]. This optimality
breaks down once the estimators are allowed to make errors with
unbounded variance; the asymptotic variance of the maximum-likelihood
estimate now becomes unbounded [33]. In light of this fact, we seek an
estimator that minimizes the worst-case variance for timing errors drawn
from probability distributions belonging to P in (3.1).
The work in [34] considers a class of estimators known as M -estimators
that are of the form
n∑
k=1
ψ[Yk − θ] = 0 (3.4)
where ψ is a general odd-symmetric function known as the influence
function, as shown in Figure 3.5. The quantities Yk are the noisy
observations, and θ is the parameter to be estimated. (Note: ψ(x) = x
gives the least squares estimate, and ψ(x) = −f ′(x)/f(x) gives the
maximum-likelihood estimate for the pdf f(x)[33].) Under certain
regularity conditions on ψ and f , such M -estimates are consistent and
asymptotically normal [33]. Huber in [34] derives a robust M -estimator by
first identifying the least informative distribution in the class of
-contaminated distributions; this gives the worst-case variance. The
25
−5 −4 −3 −2 −1 0 1 2 3 4 5
−5
−4
−3
−2
−1
0
1
2
3
4
5
x
ψ(
x)
(y1−θ)/σ
(y2−θ)/σ
(yM−θ)/σ
(a) Linear estimate
−5 −4 −3 −2 −1 0 1 2 3 4 5
−5
−4
−3
−2
−1
0
1
2
3
4
5
x
ψ(
x)
(y1−θ)/σ
(y2−θ)/σ
(yM−θ)/σ
k
k
−k
−k
(b) Robust estimate
Figure 3.5: (a) For  = 0, ψ function corresponding to the
maximum-likelihood estimate is linear. (b) For  > 0, the ψ function for the
maximum-likelihood estimate is a nonlinear function that has a clipping
effect.
maximum-likelihood estimate for this least informative distribution is the
desired robust estimate. For the case of -contaminated N (0, 1)
distributions, the influence function, ψ, is given by
ψ(x) =
x, if |x| ≤ kk sgn(x), else (3.5)
where k is a constant that depends only on  and the nominal distribution,
N (0, 1) [34].
The preceding discussion uses a standard normal distribution for the
nominal distribution. While the zero-mean assumption may be admissible,
often the variance is different than unity. In such cases, it becomes
necessary to estimate both the mean and the variance. See [34] for a simple
iterative scheme, called the one-step Huber algorithm, that takes a
preliminary estimate of the variance and then estimates the mean.
Having identified the optimal estimator for use in this application, it is
possible to derive guidelines for designing the filter banks of the PN
correlator. A robust matched filter uses polyphase decomposition to yield
multiple estimates that are fused to generate the final output. The robust
PN-code correlator is a specific instance of a robust matched filter. The
following theorem, termed the Equipartition Theorem, establishes a
sufficient condition for designing robust matched filters that will result in
26
minimum variance of the estimate.
Theorem 1. Equipartition Theorem: A robust k-phase matched-filter
implementation produces an estimate with minimum variance if the filter
banks are of equal energy.
Proof. See Appendix A.
Corollary 1 provides the corresponding design guideline for the specific
case of a PN-code correlator.
Corollary 1. The robust k-phase PN-code correlator produces an estimate
with minimum variance if the filter banks are of equal length.
Proof. See Appendix A.
Consequently, we choose an equally partitioned implementation as shown
in Figure 3.2.
3.6 Hardware Implementation
For the class of parallel and statistically similar systems, the adoption of
the estimation-theoretic framework results in the application of the theory
of robust statistics. However, in the context of SoCs, the severe constraints
on computational resources require that we approximate the results of
classical estimation theory. The overhead of computing the optimal
M -estimate may prove to be too expensive in many practical applications.
This section describes the details of hardware implementation of a PN
acquisition block that is commonly used in wireless CDMA systems.
(This work was performed in close collaboration with Dr. Girish
Varatkar while he was a graduate student at the University of Illinois. My
particular contribution was in identifying and developing the algorithms
used here, and Dr. Varatkar performed the hardware simulations. All
results are included here for completeness of the exposition, and were
previously published in [35].)
27
(a) (b)
Figure 3.6: Matched filter for PN acquisition: (a) direct form and (b) an
SNC-based architecture.
3.6.1 Stochastic Networked Computation
From the estimation-theoretic view of computation, SoCs resemble networks
of computational elements that are susceptible to hardware errors. The
notion of the outputs of networked processing elements being stochastic has
been presented in the author’s prior publication [35] as stochastic networked
computation (SNC). The results presented in this section are reproduced
from this publication and use this term. In this section, the power
consumption and performance of SNC-based architectures are compared to
those of conventional architectures through hardware simulations.
Conventional architecture
Figure 3.6(a) shows the direct-form implementation of the matched filter
with N taps. Multiply-accumulate (MAC) units are commonly employed to
compute the correlation of the received signal with the PN code [36]. In the
conventional architecture, the MACs are designed and operated at a critical
supply voltage, Vdd−crit, such that the worst-case critical path (with respect
to process, voltage, and temperature corners and the input) satisfies the
timing constraint determined by the clock period. The direct-form
structure in Figure 3.6(a) can be parallelized through a polyphase
decomposition. The critical path delay of the MAC units in this
implementation will be equal to that of the MAC units in the SNC-based
architecture presented next.
28
SNC-based, robust, low-power architecture
The polyphase filter-banks of the SNC-based architecture provide
statistically similar estimates of the final computation result, as shown in
Figure 3.6(b). In the absence of timing errors, the outputs of the M filter
banks are contaminated by Gaussian noise due to subsampling of the input.
If the computations in the filter banks are prone to timing errors, caused
either by process variations or by voltage overscaling, their outputs will be
contaminated by a large-variance distribution that is non-Gaussian and
hard to model. The M -operand adder in the conventional architecture is
replaced by a fusion block that is capable of producing a final output that
is robust to timing errors.
3.6.2 Architecture of the fusion block
In order to solve (3.5), we need to estimate the variance, σ, of the
observations. The one-step Huber algorithm [34] can be used to estimate σ
and the parameter θ, as shown below:
1. Compute scale estimate σ (median absolute deviation):
θˆ0 = median{yi}
σˆ = 1.4826 ∗median{|yi − θˆ0|}
2. Compute location estimate θˆ:
θˆ1 = θˆ0 +
1
M
∑
i ψ(yi − θˆ0, σˆ · ktable)
0.5
An implementation of the one-step Huber algorithm is shown in
Figure 3.7. The median filter for the robust fusion algorithm is
implemented using the architecture described in [37] by replacing the
analog majority gate with a digital version. The value of ktable used in the
one-step Huber algorithm depends only on  and, hence, is predetermined
for a few values of  and stored in a ROM. Since the sample median is
another robust order statistic, we consider it as a practical approximation
of the one-step Huber algorithm. In the following, SNC (median) and SNC
29
Figure 3.7: The one-step Huber fusion-block architecture.
Table 3.1: Gate-level complexity of various components of the robust PN
acquisition system for M = 8 and 13-bit filter-bank outputs.
Block Full adders Registers Other gates
Filter bank (Multiply-accumulate) 104 280 72
Median 104 - 416
One-step Huber 772 - 1168
(one-step) denote SNC-based architectures that employ the sample median
and the one-step Huber fusion algorithms, respectively. The gate
complexity of the sample median block and the one-step Huber architecture
(M = 8) is tabulated in Table 3.1.
3.7 Simulation Results
This section evaluates the performance and power consumption of the
fusion-block designs presented in the previous section, namely the one-step
Huber algorithm and the median block as a practical but approximate
algorithm.
30
Figure 3.8: Simulation setup showing (a) various process and voltage
conditions for the process variations scenario and the VOS scenario and (b)
normalized delay distributions of various gates for a 3σg slow die with
variations at Vdd = 1.2 V.
3.7.1 Algorithmic setup
The system-level throughput of the PN-code acquisition system was chosen
to be 12.5 Mchips/s. We chose a PN code of length N = 256 from a subset
of the length 215 PN sequence specified in the CDMA2000 standard. The
received signal was assumed to be an 8-bit length-1000 subsequence of the
same PN sequence corrupted by AWGN channel noise to yield an
SNR = −12 dB. The fusion algorithms were implemented in MATLAB.
We assume that the fusion blocks operate in an error-free manner. The
performance of the PN-code acquisition is measured by its receiver
operating characteristic (ROC) that plots the probability of detection
(PDet) versus probability of false-alarm (PF ). A false-alarm event occurs
when the threshold detector incorrectly declares the presence of a PN code
in the input. In accordance with the CDMA2000 standard, a false-alarm
rate of 5% was chosen to compare the performance and power consumption
of the different designs.
3.7.2 Architecture and circuit setup
We consider process variations and voltage overscaling as two different
sources of timing errors. Figure 3.8(a) lists the choice of parameters for
process variation and VOS models used here. The nominal supply voltage
31
for this 130 nm CMOS process from IBM is 1.2 V. Process variations are
simulated by modeling circuits designed for the 3σ slow process corner at
critical supply voltage, Vdd−crit =1.2 V, and under adaptive supply-voltage
(ASV) boost and adaptive body-bias (ABB) voltage. This process is
repeated 30 times in order to obtain 30 instances of the conventional and
SNC architectures. Each of the 30 instances of the conventional and the
SNC architectures was simulated using an HDL simulator that operates at
the gate level. Voltage overscaling simulations are performed at the
nominal process corner at various subcritical supply voltages that are listed
in the second row of Figure 3.8(a)
The delays for logic gates such as the full adder and the XOR were
characterized using Monte Carlo simulations for the IBM 130 nm CMOS
process. Figure 3.8(b) shows the normalized delay distributions resulting
from process variations at Vdd = 1.2 V. The transistor-level netlists of the
filter banks were simulated using HSPICE, and their power consumption at
different supply voltages and body-bias voltages was obtained using random
input vectors. The power consumption of the fusion blocks (conventional
M -operand adder and median) was obtained using Synopsys Design
Analyzer for a fsclk = 12.5 MHz.
3.7.3 SNC performance under VOS
Figure 3.9 shows that the SNC-based architecture offers better PDet at all
values for PF under VOS. The performance of the median is comparable to
the one-step Huber algorithm. We therefore suggest the median as a
practical alternative that is an approximation to the one-step estimate.
Each plot also shows the performance of an error-free (non-overscaled)
matched filter as a comparison.
Figure 3.10 shows the power consumption versus performance trade-off at
PF = 5% for two different polyphase structures. In (a), the polyphase
decomposition consists of M = 8 filter banks, and in (b), it consists of
M = 32 filter banks. Figure 3.10(a) for M = 8 shows that PDet for SNC
(median) is three orders of magnitude better than that of the conventional
architecture and reduces power consumption by 36%. The power savings
remain approximately unchanged with M = 32, as shown in Figure 3.10(b).
32
10−6 10−5 10−4 10−3 10−2 10−1 100
10−3
10−2
10−1
100
PF
P D
et
KVOS=0.75
 
 
Error−free
Conventional
SNC (1−step)
SNC (Median)
(a) M = 8
10−6 10−5 10−4 10−3 10−2 10−1 100
10−3
10−2
10−1
100
PF
P D
et
KVOS=0.75
 
 
Error−free
Conventional
SNC (1−step)
SNC (Median)
(b) M = 32
Figure 3.9: ROCs of the PN-code detector at Kvos = 0.75 for (a) M = 8,
and (b) M = 32 show that the SNC-based architecture offers better PDet at
all values of PF .
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
Po
w
er
 (m
W
)
 PDet=0.98 
PDet=0.5 PDet=0.5
36 %
Process 
Vdd (V)
Nominal Nominal Nominal
1.2 1.15 0.9
PDet=0.001
Nominal
0.9
 
 
Conventional
SNC (Median)
(a) M=8
0
0.5
1
1.5
2
2.5
Po
w
er
 (m
W
)
 PDet=0.98 
PDet=0.5 PDet=0.52
34%
Process 
Vdd (V)
Nominal Nominal Nominal
1.2 1.1 0.84
PDet=0.03
Nominal
0.84
 
 
Conventional
SNC (Median)
(b) M=32
Figure 3.10: For approximately equal PDet, the SNC-based architecture
reduces power consumption by 36% with M=8 in (a) and by 34% with
M=32 in (b).
3.7.4 SNC performance under process variations
Figure 3.11(a) shows the ROCs and histograms of PDet of the SNC-based
implementation under process variations. The choice of supply voltages and
body-bias voltages were as listed in Figure 3.8(a). Figure 3.11(b) shows
that the SNC architecture improves the mean of PDet by approximately
33
10−5 10−4 10−3 10−2 10−1 100
10−2
10−1
100
PF
P D
et
 
 
Error−free
Conventional
SNC (1−step)
SNC (Median)
(a) ROC, M = 8
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
5
10
15
20
25
30
PDet
N
o.
 o
f i
ns
ta
nc
es
 
 
Error−free
Conventional
SNC (1−step)
SNC (Median)
Conventional
µ =0.001
σ=0.003
σ/µ = 3
SNC (1−step)
µ =0.78
σ=0.015
σ/µ=0.02
SNC (Median)
µ =0.80
σ=0.008
σ/µ=0.01
(b) PDet, M = 8
10−5 10−4 10−3 10−2 10−1 100
10−2
10−1
100
PF
P D
et
 
 
Error−free
Conventional
SNC (1−step)
SNC (Median)
(c) ROC, M = 32
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
5
10
15
20
25
30
PDet
N
o.
 o
f i
ns
ta
nc
es
 
 
Error−free
Conventional
SNC (1−step)
SNC (Median)
Conventional
µ =0.128
σ=0.034
σ/µ = 0.27 SNC (1−step)
µ =0.94
σ=0.015
σ/µ = 0.016
SNC (Median)
µ =0.89
σ=0.015
σ/µ = 0.016
(d) PDet, M = 32
Figure 3.11: Under process variations, the ROC plots in (a) for M = 8 and
(c) for M = 32 show that the SNC-based architecture offers better PDet at
all values of PF . The histograms of PDet for a fixed PF in (a) and (b) show
an improvement in the mean PDet by approximately three orders of
magnitude in (b) and by approximately one order of magnitude in (d).
three orders of magnitude for a fixed PF of 0.05. Additionally, the SNC
improves the variation, (σ/µ), of PDet (σ/µ) by two orders of magnitude.
Figure 3.12(a) compares the power consumption of the conventional and
the SNC architectures. Also shown here are the probabilities of detection at
each setting of Vdd. At the 3σg slow process corner (second bar), the power
consumption of the conventional architecture drops by 8%, but this causes
a reduction in PDet by three orders of magnitude. The application of ABB
and ASV (third bar) to reduce the gate delays and correct the timing errors
increases the power consumption of the conventional architecture by 33% at
(1.35 V, 0.2 V) while achieving a mean PDet = 0.83. The SNC architecture
(fourth bar) at (1.15 V, 0.2 V) achieves a comparable PDet = 0.80, but
34
01
2
3
4
5
6
7
Po
w
er
 (m
W
)
Process corner
Vdd (V)
Nominal 3σg slow 3σg slow 3σg slow
1.151.351.151.15
Vb = 0.2 V
 PDet=0.98 PDet=0.001
PDet=0.83
PDet=0.80  
31 %
 
 
Conventional
SNC (Median)
(a) M =8
0
0.5
1
1.5
2
2.5
3
3.5
4
Po
w
er
 (m
W
)
Process corner Nominal 3σg slow 3σg slow
3σg slow
1.01.351.151.15Vdd (V)
Vb = 0.2 V
39 %
 PDet=0.98 
 PDet=0.128 
 PDet=0.57 
 PDet=0.61
 
 
Conventional
SNC (Median)
(b) M =32
Figure 3.12: The SNC-based architecture reduces power consumption by
31% for M = 8 in (a), and by 39% for M = 32 in (b) at a fixed PF =0.05.
consumes 31% lower power than the conventional architecture. Similar
trends can be observed for the case when M = 32 in Figure 3.12(b).
3.8 Conclusion
This chapter has identified a new avenue for power reduction.
Parallelization has often been used as a technique to improve throughput;
however, here we have additionally exploited it for gaining robustness. The
estimation-theoretic framework has proven to be a guide in identifying the
relevant results from robust statistics for this design problem. The PN-code
acquisition application described in this work offers up to 36% savings in
power consumption. The notable feature of the redesign here is that it
allowed us to use the same computational block as the traditional system.
The only change that was required was to swap the traditional fusion block
with a robust counterpart. Therefore, we envision this technique to be
easily adopted in many new parallel applications.
Generalizing this decomposition to include correlated and non-identical
estimates will benefit many additional applications, and will be addressed
in the next chapter.
35
CHAPTER 4
CLASS II: GENERAL PARALLEL
SYSTEMS
Class I systems imposed severe restrictions on the noise present in the
subcomputation results. Under these restrictions, the results of classical
robust estimation theory were directly applicable. However, adopting
estimation-theoretic design ideas in other applications requires that such
restrictions be relaxed. This chapter develops methods for robust estimation
with correlated observations that are applicable to general parallel systems.
4.1 Class Overview
The central concept in the estimation-theoretic framework is that modern
systems-on-chip (SoCs) look like a network of several subcomponents.
These subcomponents can be seen to produce estimates (or observations) of
the final result. The system designer assumes an observation model that
specifies how the subcomputation results are related to the final SoC
output. For example, in parallel implementations, the final output may be
a linear combination of the different subcomputation outputs.
Appropriately designed post-processing by means of a robust estimator is
employed to mitigate the effect of hardware errors that may occur in the
subcomponents. The assumed system model and models of the noise
sources define the particular robust estimation method used. We will also
refer to this estimation process as fusion, because of the analogy to
traditional sensor networks.
In the previous chapter, this framework was successfully applied in a PN
acquisition block of a wireless CDMA receiver [35]. This application used
voltage overscaling as an error-prone power-reduction technique. The
matched-filter operation in this block was parallelized through polyphase
decomposition. A fusion block computed a robust estimate of the overall
36
computation based on the outputs of these parallel subcomputations.
Section 4.2.1 provides an overview of robust estimation theory and methods.
The PN acquisition application was able meet all the limitations of
classical robust estimation theory. These limitations are in the form of the
following assumptions regarding the observation noise [34]:
1. Independence: The noise processes are independent,
2. Homoscedasticity: Their variances are equal, and
3. Normality: Their nominal distribution is standard normal.
In the estimation-theoretic view of an SoC, the normality assumption is
often satisfied because the nominal noise is a consequence of the input or
measurement. Heteroscedastic extensions to robust estimation derived in
[38] address the problem of unequal variances of the random noise across
the different observations. This work extends results from [34] and derives
the optimal weights for accommodating unequal variances. As early as
1964, Huber [34] recognized the need to relax the independence assumption,
and, as of this writing, theoretical extensions to address this issue still seem
absent. Robust techniques that can accommodate correlated observations
will allow us to apply the estimation-theoretic framework to other new
applications.
This chapter is organized as follows. Section 4.2.1 overviews classical
robust estimation theory. Section 4.3 describes the main contribution of
this chapter. The methods presented here extend robust estimation to
correlated situations. Section 4.4 describes an application of this method in
an error-prone multiantenna communication system. Section 8.1 describes
potential extensions to this work, and Section 7.5 concludes this chapter.
4.2 Application: Multiantenna Communication
Receiver
Consider a multiuser wireless communication system. Here it is common to
use multiple antennas at the base station and allow a number of users to
communicate at the same time. By using multiple transmit antennas and
multiple receive antennas, such a system is able to exploit channel-fading
37
conditions to increase capacity. Figure 4.1 shows one such setup. The base
station is equipped with an r-antenna system that receives transmissions
from t users. The received signal corresponding to the i-th user is
contaminated by a combination of thermal noise and interference from the
transmissions of the other users. It is common to view the two sources of
noise jointly as a combined random vector that is correlated [31]. The
problem then is to estimate the symbols transmitted by a particular user
who is sharing this channel. A practical method of estimating the desired
symbol is first to pre-whiten the observations (to undo the effect of the
noise correlation) and then perform a least squares estimate. This method
is usually referred to as minimum mean squared estimation (MMSE) [31].
If we allow aggressive power reduction techniques at the receiver, then the
noise will contain outliers caused by hardware errors. The traditional
approach of pre-whitening the observations will spread the outliers
throughout the observations, thereby rendering all of them useless. The
currently known robust estimation methods fail to fully address this
problem. The methods developed in this work are meant to address
estimation problems of this nature.
4.2.1 Robust estimation for error tolerance
Classical parameter estimation methods use probabilistic models of the
observations to arrive at an estimate. A least squares estimate is typically
used when the noise can be modeled as a Gaussian random process. Here,
the weighted mean serves as the optimal estimate [39]. In the absence of
hardware errors, noise in the estimation-theoretic model of an SoC is often
modeled well by a Gaussian distribution. For such systems, the
post-processing by a least-squares fit to the system model will work well.
However, the weighted mean is entirely nonrobust to deviations of the
actual observations from the model assumptions of normality. Therefore, if
it is difficult to accurately model the noise process, then there is a need for
robust techniques.
The theory of robust statistics [34] addresses the problem of estimating
an underlying parameter on the basis of observations that are contaminated
by noise, η, drawn from a mixture distribution. This mixture distribution is
38
User 1 
User 2 
User t 
y1 
y2 
yr 
Figure 4.1: The r receive antennas exploit spatial multiplexing of t
simultaneous transmissions. The observation vector {y1, · · · , yr} contains
noise that is usually modeled as a multivariate Gaussian random vector.
used to model noise that is nominally Gaussian but occasionally drawn
from an unknown, large-variance distribution, as shown in (4.1).
η ∼ (1− )N (0, σ2I) + (1− )H (4.1)
Here, H denotes an unknown contaminant distribution with finite but large
variance. The noise samples that are drawn from this distribution are
termed outliers. The quantity N is the normal distribution with mean 0
and variance σ2, and is often referred to as the nominal noise. The term I
denotes the identity matrix, indicating that the noise samples are assumed
to be uncorrelated. The term  is the probability with which an outlier
occurs. Robust parameter estimation methods insure that infrequent
outliers will not carry undue influence on the estimate. This theory,
developed by Huber [34], seeks to find the least informative distribution
from the class in (4.1). The robust estimate is then the
maximum-likelihood estimate for this distribution and has the effect of
altering the traditional least-squares cost function to a more slowly
39
increasing function in outlier regimes. A procedure known as M -estimation
[34] is used to find the estimate that minimizes this robust cost function.
Robust estimation methods provide a natural means of achieving error
tolerance in nanometer SoCs. The nature of hardware errors is
fundamentally different from that of input or system noise. In particular,
under voltage/frequency overscaling, violation of timing constraints can
cause hardware errors. These errors manifest as bit-flips in arbitrary
locations within a data word (depending on the specific architecture of the
functional units). For most commonly implemented architectures, such
errors tend to occur more frequently in the most significant bits, because of
their LSB-first nature. Hardware errors will therefore represent statistical
outliers. The reader is referred to [32] for a study of the probability of the
occurrence of hardware errors when aggressive power-reduction techniques
are adopted. The fusion block in the estimation-theoretic design [35]
derives from the theory of robust statistics and is able to withstand timing
violations.
4.3 System and Noise Model
This section describes a new technique for estimating a parameter from its
linear measurements that are corrupted by dependent noise with outliers.
The system model for this problem can be stated as
y = Ax+ η (4.2)
where y are the observations, A is the system matrix that is assumed to be
known, x is the unknown parameter that needs to be estimated, and η is
additive noise.
To account for correlated observations, we can generalize the noise model
(4.1) in the following manner:
η ∼ (1− )N (0,Σ) + (1− )H (4.3)
Here the nominal noise process is no longer independent. It is drawn from a
multivariate Gaussian distribution with a known covariance matrix, Σ, and
mean 0. As defined earlier, the parameter  is the probability of occurrence
40
of outliers drawn from an unknown distribution H.
The outliers are assumed to be generated by a mechanism that operates
independently on each element of the observation vector. That is, outliers
will replace entries of the nominal multivariate Gaussian vector on an
element-by-element independent basis.
The discussion in this section is not limited to the scope of SoC design
problems and addresses the general robust estimation problem. The results
will later be applied to a specific SoC design problem in Section 4.4.
4.3.1 Related work on robust estimation with correlated
observations
While robust statistics has been studied extensively [34, 40], published
results on robust techniques in the presence of correlation among
observations are surprisingly very few [41].
Recently, Wu [42] has reported on the asymptotic behavior of
M -estimation of linear models with dependent errors. This work also
provides an excellent overview of several previously known results under
different restrictive error models. This line of research investigates the
negative impact that dependence has on the M -estimate. However, in our
present context, we may have a priori knowledge about the correlation
structure among the observations. In particular, we seek to exploit this
information to improve our estimation accuracy. For this purpose, these
results on asymptotic behavior of estimators that ignore correlation
information are not applicable, other than to show the importance of
considering correlations when present.
Some earlier works do consider explicit knowledge of correlation, but
restrict the structure of the correlation to very specific types. Masrotto [43]
developed robust estimators for auto-regressive-moving-average parameters
of time-series data. Portnoy in [44] extended the traditional M -estimation
of Huber to include mild dependence generated by a first-order
moving-average process. The authors in [45] developed M -estimators for
wide-sense-stationary noise. General SoCs need not exhibit these noise
structures. Correlation can be generated by the application scenario or as a
consequence of the way computation is partitioned in an SoC. As a result,
41
we need to accommodate arbitrary correlation structures in order to gain
robustness to hardware errors. To the best of our knowledge, there do not
exist good extensions of Huber’s M -estimation that can adequately address
such situations.
Recently, researchers in the field of geodesy have studied this problem
and have developed some techniques. The authors in [46] adopt a
variance-inflation model for outliers. Under this model, outliers are seen as
inflating the variance of the corresponding observation. They use the
distribution of the residuals to flag potential outliers. Upon identifying the
outliers, they adjust the covariance matrix to reflect the corresponding
variance inflation. However, this method does not account for the fact that
the outlying observations are typically not correlated with the remainder of
the observations. That is, the corresponding entries of the original
covariance matrix are no longer valid if there are outliers present.
More recently the authors in [47] have proposed a method that directly
operates on the inverse of the covariance matrix. This method also depends
on successfully identifying outliers and adjusting entries of the inverse of
the covariance matrix. While it may have performed well for datasets
commonly encountered in geodetic applications, it does not work well with
general outlier models. In particular, as shown in Section 4.4, this method
performs poorly in outlier models caused by hardware errors. We suspect
that directly operating on the inverse covariance matrix does not
adequately capture the independence between outliers and nominal noise
and, therefore, results in poor performance.
4.3.2 Robust generalized least squares
In (4.2), if the noise vector, η, does not contain any outliers, then the
optimal estimate is the solution to the following problem [33]:
xˆ = argmin
x
rTΣ−1r (4.4)
where the residuals, r are given by
r = y − Ax (4.5)
42
The solution to this problem can be written as a linear function of the
observations as shown below [33]:
xˆ = (ATΣ−1A)−1ATΣ−1y (4.6)
The cost function in (4.4) is unduly influenced by outliers. It needs to be
made robust by appropriately down-weighting the influence of
large-magnitude outliers. We will develop the idea of pre-multiplying the
residual vector by a nonlinear weight matrix, as done in [34], for this
correlated situation. By so doing, the values of residuals greater in
magnitude than a threshold are diminished to control their influence. But
merely down-weighting outlying residuals will not be sufficient. The clipped
residuals do not conform to the correlation model assumptions specified by
Σ. A successful robust approach will combine elements of robustifying the
residual, as done by Huber, and robustifying the covariance matrix.
To test for outliers, we propose to compare the magnitude of each entry
of the residual vector against a clip threshold. This threshold may be
chosen either empirically or by fixing the false-alarm rate for a statistical
test as done in [48]. It is common to choose this threshold as 3σ, where σ is
the standard deviation of the residuals.
Initially, the residuals are tested to determine if any are outliers. After
this initial step, robustification proceeds in two steps. In the first step, the
influence of the outlying residuals is limited by means of the weights defined
in diagonal matrix V . The diagonal entries of this matrix are given by
Vii =
1, if |ri| ≤ k√k√
|ri|
, else
(4.7)
for an appropriately chosen threshold, k. In the second step, we
down-weight the cross-correlation terms corresponding to the outlying
residuals; that is, we multiply the rows and columns of the covariance
matrix, Σ, by a weight that is inversely proportional to the magnitude of
the corresponding outlying residual. While residuals that are truly outliers
will be uncorrelated with the remainder of the observations, this may not
be the case with residuals that are falsely declared to be outlying on the
basis of a statistical test. Therefore, we cannot simply zero-out the entries
43
(a) (b) (c)
Figure 4.2: The robustified cost function (a) has a minimum closer to the
underlying parameter (10) than (b) the Huber cost function that ignores
correlation information, and (c) the traditional nonrobust least-squares cost
function.
corresponding to the residuals that were declared to be outliers.
Algorithm 1 shows a listing of this two-step procedure.
Algorithm 1 Robust Estimation for Correlated Observations
Vii ← [1, · · · , 1]T
for i← to n do
if |ri| > k then
Vii ←
√
k
|ri|
Σ˜ij ← Σij k|ri| {j = 1, · · · , n}
Σ˜ji ← Σji k|ri| {j = 1, · · · , n}
Σ˜ii ← Σii
end if
end for
This results in the following minimization problem:
min(rTV T )Σ˜−1(V r) (4.8)
where Σ˜ is the robustified covariance matrix. Figure 4.2 shows an example
instance of this cost function. It is compared against the cost function
proposed by Huber for independent observations and the traditional
least-squares cost function. The observation vector, y, for this example is
listed in Table 4.1. Here, the first column, labeled “y,” contains the
observations that include outliers. The second column, labeled “yclean,”
shows the observations without outliers. The additive noise here is drawn
44
Table 4.1: Example observation with outliers. The second, third, and
eigthth observations contain gross outliers.
y yclean Gross Error
12.702 12.702 0
18.199 –7.5558 25.755
42.494 12.494 30
–7.3985 –7.3985 0
12.721 12.721 0
–7.4625 –7.4625 0
12.525 12.525 0
–24.436 –7.2965 –17.139
12.592 12.592 0
–9.2515 –9.2515 0
11.231 11.231 0
–9.6951 –9.6951 0
11.078 11.078 0
–9.2348 –9.2348 0
8.6804 8.6804 0
–10.509 –10.509 0
from a multivariate Gaussian distribution. The third column, titled “Gross
Error,” lists the outliers that were added on an element-by-element basis to
the entries of the second column to yield the first column.
A note on other published methods: The technique developed here
bears a slight resemblance to the methods in [46, 47]. The important
difference is that, unlike their methods, our method robustifies the
covariance matrix and not its inverse. The quantity Σ−1 in (4.4) can be
viewed as a weight matrix that is being applied to the residuals. Each row
of this matrix contains a very specific set of interrelated weights. As an
example, consider the following matrix:
Σ =
 1.00 0.99 0.000.99 1.00 0.00
0.00 0.00 1.00

45
The weight matrix, Σ−1, is given by
Σ−1 =
 50.25 −49.75 0.00−49.75 50.25 0.00
0.00 0.00 1.00

The weights Σ−111 and Σ
−1
12 account for the high correlation between the first
two residuals by assigning a large weight to their difference. Any approach
that directly modifies the rows and columns of Σ−1 will disturb this
interrelationship. Therefore, outlying residuals must be accounted for by
appropriately adjusting the entries of the covariance matrix. This
robustified covariance matrix must then be inverted to produce the robust
weight matrix. This is the key intuition that motivates our technique. We
believe that this is also the reason why the methods suggested by [46, 47]
that attempt to directly robustify the weight matrix fail to produce good
results.
4.3.3 Computationally efficient methods
The robust cost function shown in Figure 4.2 is nonsmooth and nonconvex.
Finding its minimum usually calls for a global search over the parameter
space. In this subsection, we will develop approximate methods for solving
(4.8) that are practical to implement.
If we can obtain a good initial estimate for the unknown parameter, x,
we can robustify the covariance matrix just once and use it as an
approximation for finding a minimum around the initial estimate. That is,
we can solve the following minimization problem:
min f(x) = rTV T Σˆ−1V r (4.9)
where Σˆ is the approximate robust covariance matrix that is obtained from
the initial estimate. Figure 4.3 shows a comparison of this approximate cost
function with the robust cost function for the example data set in Table
4.1. Notice that it is minimized at approximately the same location.
We propose to use Huber’s robust estimate [34] (ignoring the covariance
information) as an initial estimate. The one-step methods described in [34]
46
8 8.5 9 9.5 10 10.5 11 11.5 12 12.5 13
200
250
300
350
400
450
500
Robustified Cost Function
 
 
Robust Cost
min
Approx. Cost
Figure 4.3: The initial estimate (e.g., sample median) is
used to robustify the covariance matrix. This matrix is used
as a fixed approximation to the actual robust covariance
matrix, Σ˜.
may provide a computationally efficient solution to this problem.
Additionally, the sample median, which is another robust order statistic,
may also be used as a further approximation of the Huber estimate for such
problems.
Viewing (4.9) as the following nonlinear, least squares problem will allow
us to adopt many well-known iterative algorithms, such as
min f(x) = ||g(x)||2 (4.10)
where
g(x) = Σˆ−
1
2V r
In particular, we derive the following Gauss-Newton iteration [49] for this
problem:
xk+1 = xk + αk
(
W T Σˆ−1W
)−1 (
W T Σˆ−1
)
(V r) (4.11)
where x0 is an initial estimate and αk is a step-size chosen by one of the
known methods [49] (e.g., minimization rule or the limited minimization
47
rule). The matrix W is given by
W = V A+ V˜ r (4.12)
The matrix V˜ is a diagonal matrix consisting of the derivatives of the
diagonal of the matrix V , that is,
V˜ii =
0, if |ri| ≤ k√k
2
√
|ri|
, else
where k is the previously defined threshold. The zero terms occur because
of the unity entries corresponding to nonoutliers in the diagonal of the
matrix V .
Under the assumption stated below, this practical iterative method can
indeed be shown to converge to a stationary point of (4.9). This
assumption is required to guarantee that every iteration takes a sufficiently
large step along a descent direction.
Assumption 1. The approximate covariance matrix, Σˆ, of the estimation
noise is positive definite.
Proposition 1 proves that this iterative method will converge to a
stationary point of (4.8).
Proposition 1. Let {xk} be a sequence generated by the iterative method
(4.11). Then, under Assumption 1, every limit point of {xk} is a stationary
point.
Proof. The general form of gradient methods is as follows [49]:
xk+1 = xk − αkDk∇f(xk) (4.13)
where Dk is a positive definite symmetric matrix. The descent condition is
given by
∇f(xk)TDk∇f(xk) > 0
For the Gauss-Newton method,
Dk =
[
∇g(xk)∇g(xk)T
]−1
(4.14)
48
The inverse in (4.14) exists because ∇g(xk) in (4.10) has full rank.
∇g(xk) = [V A+ V˜ r]TΣ− 12T (4.15)
Dk =
[
(V A+ V˜ r)TΣ−1(V A+ V˜ r)
]−1
(4.16)
By the positive definiteness assumption on Σ and the fact that
(V A+ V˜ r) 6= 0, the descent condition is met. Therefore, the sequence
{dk} = −Dk∇f(xk) is gradient related to {xk}.
The gradient of the cost function in (4.10) is given by
∇f(xk) = (V A+ V˜ r)TΣ−1V r (4.17)
The result follows from Proposition 1.2.1 of [49].
If the eigenvalues of Dk are bounded above and below [49], then a
constant step-size may be chosen for convergence.
With a good initial estimate, we may ignore the terms containing V˜ in
(4.12). This results in the following simplified iteration:
xk+1 = xk + αk
(
ATV TΣ−1V A
)−1 (
ATV TΣ−1V
)
r (4.18)
4.3.4 Simulation and Results
In order to validate the methods developed in this section, we simulated
test data consisting of 1000 runs. Each run contained 16 observations of a
location parameter that was chosen to be 10. The A matrix in the system
model (4.2) was chosen to be
A = [1,−1, 1, · · · ,−1]T
The observations were contaminated by noise drawn from a mixture model
that was correlated Gaussian with probability 0.8, and a large-variance
Gaussian distribution shown in Figure 4.4, with probability 0.2. The
nominal correlated noise was obtained by multiplying a N (0, 1) Gaussian
sequence by the square root of a known covariance matrix. This matrix is
listed in Table 4.2.
49
−100 −80 −60 −40 −20 0 20 40 60 80 100
0
0.005
0.01
0.015
0.02
0.025
0.03
0.035
Error Magnitude
Pr
ob
ab
ilit
y
Large−Variance Gaussian Errors
Figure 4.4: Random variables drawn from a large-variance Gaussian
distribution represent outliers. With probability  an outlier is drawn
according to this model.
50
Σ
=
                           1.0
0
0.
99
0.
99
0.
99
0.
99
0.
99
0.
99
0.
99
0.
99
0.
00
0.
00
0.
00
0.
00
0.
00
0.
00
0.
00
0.
99
1.
00
0.
99
0.
99
0.
99
0.
99
0.
99
0.
99
0.
99
0.
00
0.
00
0.
00
0.
00
0.
00
0.
00
0.
00
0.
99
0.
99
1.
00
0.
99
0.
99
0.
99
0.
99
0.
99
0.
99
0.
00
0.
00
0.
00
0.
00
0.
00
0.
00
0.
00
0.
99
0.
99
0.
99
1.
00
0.
99
0.
99
0.
99
0.
99
0.
99
0.
00
0.
00
0.
00
0.
00
0.
00
0.
00
0.
00
0.
99
0.
99
0.
99
0.
99
1.
00
0.
99
0.
99
0.
99
0.
99
0.
00
0.
00
0.
00
0.
00
0.
00
0.
00
0.
00
0.
99
0.
99
0.
99
0.
99
0.
99
1.
00
0.
99
0.
99
0.
99
0.
00
0.
00
0.
00
0.
00
0.
00
0.
00
0.
00
0.
99
0.
99
0.
99
0.
99
0.
99
0.
99
1.
00
0.
99
0.
99
0.
00
0.
00
0.
00
0.
00
0.
00
0.
00
0.
00
0.
99
0.
99
0.
99
0.
99
0.
99
0.
99
0.
99
1.
00
0.
99
0.
00
0.
00
0.
00
0.
00
0.
00
0.
00
0.
00
0.
99
0.
99
0.
99
0.
99
0.
99
0.
99
0.
99
0.
99
1.
00
0.
00
0.
00
0.
00
0.
00
0.
00
0.
00
0.
00
0.
00
0.
00
0.
00
0.
00
0.
00
0.
00
0.
00
0.
00
0.
00
1.
00
0.
00
0.
00
0.
00
0.
00
0.
00
0.
00
0.
00
0.
00
0.
00
0.
00
0.
00
0.
00
0.
00
0.
00
0.
00
0.
00
1.
00
0.
00
0.
00
0.
00
0.
00
0.
00
0.
00
0.
00
0.
00
0.
00
0.
00
0.
00
0.
00
0.
00
0.
00
0.
00
0.
00
1.
00
0.
00
0.
00
0.
00
0.
00
0.
00
0.
00
0.
00
0.
00
0.
00
0.
00
0.
00
0.
00
0.
00
0.
00
0.
00
0.
00
1.
00
0.
00
0.
00
0.
00
0.
00
0.
00
0.
00
0.
00
0.
00
0.
00
0.
00
0.
00
0.
00
0.
00
0.
00
0.
00
0.
00
1.
00
0.
00
0.
00
0.
00
0.
00
0.
00
0.
00
0.
00
0.
00
0.
00
0.
00
0.
00
0.
00
0.
00
0.
00
0.
00
0.
00
1.
00
0.
00
0.
00
0.
00
0.
00
0.
00
0.
00
0.
00
0.
00
0.
00
0.
00
0.
00
0.
00
0.
00
0.
00
0.
00
0.
00
1.
00
                           
T
ab
le
4.
2:
T
h
e
co
rr
el
at
io
n
m
at
ri
x
w
as
ch
os
en
to
in
cl
u
d
e
so
m
e
h
ig
h
ly
co
rr
el
at
ed
ob
se
rv
at
io
n
s
(i
.e
.,
1–
9)
an
d
so
m
e
ob
se
rv
at
io
n
s
th
at
ar
e
u
n
co
rr
el
at
ed
(i
.e
.,
10
–1
6)
.
T
h
is
w
as
m
ea
n
t
to
st
u
d
y
th
e
eff
ec
ti
ve
n
es
s
of
ou
r
al
go
ri
th
m
in
b
ot
h
ex
tr
em
e
sc
en
ar
io
s.
51
Table 4.3: The variance of the estimate for the different methods
Method Variance
One Step with Huber Initial 0.081
One Step with Median Initial 0.096
Full Search 0.118
Traditional Huber 0.234
Sample Median 0.247
Yang 2.06
The variance achieved over these 1000 runs was compared across the
following methods:
1. Full Search: This brute-force method searches over the entire
parameter space for the estimate that minimizes the robust cost
function.
2. Iterative with Huber Initial: This method implements the iterative
method developed in Section 4.3.3.
3. One Step with Huber Initial: In this method, the iterative method is
truncated to just one step for the sake of computational efficiency.
4. One Step with Median Initial: In this final approximation, the Huber
estimate used in the iterative method is replaced with a median. The
results in [35] indicated that the median is a powerful robust statistic
that can be viewed as an approximation of the Huber estimate.
The iterative method that used the Huber initial estimate was found to
converge typically in fewer than two iterations. A histogram of the number
of iterations to convergence is shown in Figure 4.5. This fast convergence
motivated the one-step methods that are computationally efficient.
Table 4.3 gives a comparison of the estimate variance achieved by the
different methods. The cost function contains multiple local minima, but
the median-based and Huber-based initialization methods identify the
correct neighborhood of the minimum. Taking a Gauss-Newton iteration on
the cost function with this initial estimate produces a lower variance as
shown in Table 4.3.
52
1 2 3 4 5 6 7 8 9 10
0
100
200
300
400
500
600
700
800
900
1000 952
47
3 2 1
Number of iterations
N
um
be
r o
f t
ria
ls
Figure 4.5: Convergence to within 1e-3 of the result of the iterative method
is usually in fewer than two steps. The trial counts on top of the bars are
included for readability.
53
Since the one-step method with median as an initial estimate is
computationally most efficient and yields good estimate variance, we choose
this algorithm for on-chip implementations.
4.4 Estimation-Theoretic Design
In this section, we will return to the wireless communication application
that was used as motivation for this work in Section 4.2. We assume that
the channel is known and that it is described by a matrix, H. The entries
of H contain the multiplicative coefficients for each transmit-receive pair,
and are assumed to be fixed and deterministic over a symbol duration. The
received signal can be mathematically represented as
y[m] = hkxk[m] +
∑
i6=k
hixi[m] +w (4.19)
where y[m] is the received vector in time slot m, hk is the k-th column of
the channel matrix, and xk[m] is the transmission of the k-th user in the
m-th time slot, and w is the additive thermal noise vector.
The total system cost function for this application is the mean squared
error in estimating the transmitted symbols.
The architectural space consists of N -identical processors because of the
nature of commonly used multiantenna receivers. The architectural
constraint set, A, is the discrete set of voltages, V, at which the N identical
processing elements may be operated.
4.5 Solution Method
For a given time slot m, by combining thermal noise and interference into a
compound noise term, we have
y = hx+ z (4.20)
where z is complex colored noise with an invertible covariance matrix Σ, h
is a deterministic vector, and x is the transmitted symbol. In this model, z
54
w1
×
wr
×
Symbol 
Slicer+
1
r
Rx
Rx
Figure 4.6: Multiple antenna receiver provides r observations of the desired
symbol. The weights {w1, · · · , wr} are computed by accounting for the
correlation among the noise terms.
and x are assumed to be uncorrelated. This model is discussed in
significant detail in [31] (Chapter 8). With this setup, the covariance
matrix, Σ, is given by
Σ = N0Inr +
nt∑
i6=k
Pihih
∗
i (4.21)
where (·)∗ denotes the complex conjugate, N0/2 is the noise variance, Inr is
the identity matrix with nr rows, and Pi is the power associated with the
i-th data stream.
Similar to (4.6), we can derive the linear MMSE estimate, xˆMMSE, as
xˆMMSE = (h
TΣ−1h)−1hTΣ−1y (4.22)
This receiver structure is shown in Figure 4.6. Here the weights
{w1, · · · , wr} are computed from (4.22). The outputs of the PN correlator
are multiplied by these weights and the results are summed. This final
output is fed to a slicer that detects the transmitted symbol.
We will now address the problem overcoming hardware errors in this
55
receiver. The r PN correlators can be seen as providing the observations
from which the desired transmitted symbol is estimated. We adopt
supply-voltage overscaling at the PN correlators as an error-prone
power-reduction technique. It is important to note that the techniques
presented in this chapter are generally applicable to timing errors caused by
other phenomena such as process variations. The degree of overscaling is
measured by the voltage-overscaling factor, kV OS. It is the ratio of the
supply voltage to the critical voltage (i.e., the supply voltage needed to
meet the critical path delay). Voltage overscaling can result in hardware
errors caused by timing violations. Therefore, the noise term z will
additionally be contaminated by sporadic large-variance hardware errors.
We assume that hardware errors occur independently across the correlators
[35]. With this setup, we use the methods developed in Section 4.3 to
robustify the weights {w1, · · · , wr}.
4.5.1 Fusion block architecture
Figure 4.7 shows an architectural block diagram of a fusion block that
implements the algorithm developed in Section 4.3. An efficient median
that was used in [35] provides the initial estimate. This estimate is used to
compute the initial residuals. The residuals, {r1, · · · , rn}, are then tested
for outliers by comparing their absolute values against a threshold. The
results of this test, stored in the indicator variables, {i1, · · · , ir}, are used to
clip the residuals and also to robustify the covariance matrix. In the final
step, an update is computed by means of the least squares estimate block
that is traditionally used in this application.
4.5.2 Simulation setup
Our simulation system consisted of nt = 5 simultaneous transmissions and
nr = 16 receive antennas. The objective was to estimate the symbols
transmitted by User 1 in the presence of interference from the other users
and thermal noise. The thermal noise was modeled as Gaussian random
variables added at chip-rate with a variance of 0.03. The entries of the
channel matrix, H, are complex Gaussian random variables generated as
56
Median
h1 h2 hr
r1 r2 r3
Xrobust
ρ11 ρ1r
ρr1 ρrr
Σ, Covariance Matrix
Traditional LS Estimate
r1
r2
r3
i3 i3 i3
Figure 4.7: In this fusion block, the median is used as a robust initial
estimate. An update to the median is computed by robustifying the
observations and robustifying the covariance matrix.
57
ï10000 ï5000 0 5000
0
0.2
0.4
0.6
0.8
1
kV OS = 0.75
ï1 ï0.5 0 0.5 1
x 104
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
kV OS = 0.67
ï1 ï0.5 0 0.5 1
x 104
0
0.1
0.2
0.3
0.4
0.5
kV OS = 0.58
ï1 ï0.5 0 0.5 1
x 104
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
kV OS = 0.50
Error Magnitude
Pr
ob
ab
ilit
y o
f E
rro
r
Figure 4.8: With increasing voltage overscaling (top-left to bottom-right),
the timing errors become more frequent. The bit errors also begin to occur
in locations with lower significance.
follows:
Hij =
1√
2
(Hre + jHim) (4.23)
where Hre and Him are independent Gaussian random variables with mean
zero and variance 1. The transmitted data symbols were modulated using
binary phase shift keying (BPSK) modulation. The information symbols
were then multiplied by a length-64 PN-code sequence that is unique to
each user. The received signal was assumed to be quantized to 8 bits.
Gate delays for various gates in the PN correlators were obtained from
HSPICE models for an IBM 130 nm CMOS process. Verilog HDL models
were used to obtain overscaling-related timing errors for random input
sequences. Figure 4.8 gives histograms of timing errors that occurred in the
PN correlators. In obtaining these, the clock frequency was overscaled as a
surrogate for voltage overscaling. Notice that with greater overscaling, the
58
0.65 0.7 0.75 0.8 0.85 0.9 0.95 110
−1
100
101
102
103
104
105
kV OS
M
SE
 
 
MMSE
One−Step Median Initial
Huber
Yang
Non_Robust_Cov
Figure 4.9: A comparison of the various robust techniques for correlated
observations. The technique developed here outperforms other known
methods by a factor of 18 in some voltage-overscaling regimes.
large-variance hardware errors occur with higher probability. The actual
scaling-related hardware errors obtained here bear close resemblance to the
software models used in Section 4.3.
A MATLAB simulation was used to implement the fusion algorithm.
4.6 Results: Symbol Estimation
Figure 4.9 compares the mean squared error in estimating the symbols of
User 1 using several techniques. The conventional MMSE estimate (labeled
“MMSE”) begins to deteriorate as soon as the supply voltage is scaled.
Merely robustifying the observations and not robustifying the covariance
matrix also yields poor MSE performance at all values of kV OS, as shown
by the curve labeled “Non Robust Cov.” The performance of the Huber
59
algorithm that ignores the covariance information is labeled “Huber.” The
performance of the geodetic methods in [47] is shown by the curve labeled
“Yang.” This method exhibits some robustness at higher levels of voltage
scaling, but the performance does not improve for smaller values of kV OS,
where the errors are fairly uncommon. In contrast, the algorithm developed
in Section 4.3 (labeled “One-Step Median Initial”) is able to perform well
over the entire range of kV OS and provides an MSE improvement by a
factor of 18 in some kV OS regimes.
4.7 Future Work
We have developed practical methods of estimating a parameter based on
linear observations that contain a mixture of correlated noise and sporadic
outliers. These methods are motivated by intuition. A theoretical
foundation for the methods developed here still remains to be developed.
Our future research will seek to find optimal influence functions for a given
covariance model.
We have developed four fusion algorithms that can be ordered according
their computational complexity (therefore, power demand) and mean
squared-error performance. For a given application scenario, the particular
choice of fusion algorithm will depend on the technology node and
architectural choices. In the future, we will investigate these power
trade-offs through circuit-level simulations.
4.8 Conclusion
This work has addressed the reliability issue in aggressively scaled SoCs by
making valuable connections to robust parameter estimation. In so doing,
we have identified an important generalization of robust estimation that
has remained unaddressed. The methods presented here allow observations
to be correlated according to arbitrary models. Exploiting correlation
information improved the MSE by a factor of 18 in a multiantenna wireless
communication receiver that is subject to hardware errors.
Dependent observations can occur in many signal processing applications.
60
Commonly used measurement devices are often prone to failures that could
result in outliers in the observations. Sensor networks are one such
example. The results here have shown that such systems can benefit from
exploiting the correlation information and need not be limited by the
independence assumption of traditional robust estimation theory.
Consequently, the methods presented here will be applicable in many new
signal-processing systems.
61
CHAPTER 5
CLASS III: HETEROGENEOUS SYSTEMS
The previous chapters addressed parallel systems. While parallelization is a
commonly used throughput-enhancing algorithm transform, some SoCs
may not be easily decomposable. This chapter presents estimation-theoretic
design methods for heterogeneous systems. This class of systems consists of
subsystems of varying power/reliability behaviors. The overall system
performance as a function of the individual subsystem performance is
assumed to be known.
5.1 Class Overview
In this class of applications, the overall computation can be divided into
smaller subcomputations, each of which is processed by one or more
processing elements chosen from the set {PE1, · · · , PEm}. The PEs
produce outputs of varying degrees of error tolerance. Figure 5.1 shows a
system from this class. As an example, consider a wireless video
transmission system. The raw video input is fed into a video encoder, the
video encoder is linked to a channel encoder, and the channel encoder is
linked to a wireless transmitter. The subcomptutation results may be
subject to different application cost functions:
{C1(θ1, θˆ1), · · · , Cm(θm, θˆm)}
The overall output is a known function of the results of the
subcomputations. The application cost function of the overall computation
is a known function, f(·), of the individual cost functions, i.e.,
C(θ, θˆ) = f({C1(θ1, θˆ1), · · · , Cm(θm, θˆm})
62
Overall system is differently sensitive 
to errors in subcomputations A, B, & C
Output
A
Output
B
Output
C
B tolerates some
errors in A
Figure 5.1: Class III systems contain heterogeneous subsystems.
Computational errors in the different subsystems affect the overall system
performance by different amounts.
The performance of the overall system is sensitive to the performance of the
individual components by varying amounts, and hence, the performance of
some PEs can be worsened without significant degradation in the overall
system performance. That is,
∂f({C1(θ1, θˆ1), · · · , Cm(θm, θˆm})
∂Ci(θi, θˆi)
is small for one or more i.
The system power consumption, Psys, is given by
Psys =
m∑
i=1
PPEi
where PPEi is the power consumed by the i-th PE.
63
Figure 5.2: This wireless video communication system consists of a video
encoder (A), a Reed-Solomon channel encoder (B), and a wireless
transmitter (C). These subsystems exhibit different power/performance
trade-offs and computational errors in these subsystems that impact the
overall system performance by different amounts.
5.2 Application: Mobile Video Communication
Systems
With constantly improving resolution of video captured by battery-limited
mobile devices, there is need for highly efficient video source coders.
Modern video encoding standards such as the H.264/AVC have made
tremendous advances in compression, but only at the expense of greatly
increased computation. Joint source coding and power-management
techniques, such as those in [50], have been shown to optimize the
communication power consumption in such applications. By adopting
aggressive power-reduction techniques at the video encoder and
compensating for any resulting computational errors, mobile video systems
can gain additional power reduction. For example, [51] studies the impact
of voltage-overscaling errors on the coding efficiency of some
motion-estimation architectures. The estimation-theoretic framework allows
optimal design for mobile video systems that matches computational and
communication power consumption to application demands.
5.3 System and Noise Model
Our model of a mobile video communication system consists of a video
encoder that compresses the captured video, a channel coder that adds
redundancy to the compressed stream, and a wireless transmitter.
Figure 5.2 gives a high-level description of this system. The power
64
consumed by the video communication system is the sum of the power
consumed by each of these components. The performance and power
consumption of these components are interlinked so that adopting
power-reduction schemes in any of them affects the performance and power
consumption of the others. Most video-encoder power-reduction techniques
can hurt compression efficiency; for example, a smaller power budget may
force the designer to adopt an inferior motion-search technique. Voltage
overscaling is one mechanism that could cause hardware errors that may
also result in suboptimal motion-search results. Timing errors could
generally be caused by other phenomena such as process variations. The
methods presented in this chapter apply to timing errors without being
particular to any phenomenon; voltage overscaling is simply used as one
source of errors. Hardware errors can increase the number of bits and,
therefore, the communication power consumption. Modern reconfigurable
channel coders can change the amount of introduced redundancy to adjust
their power consumption; but with less redundancy there is a need to
transmit at a higher power level in order to maintain communication
performance [52]. The power consumed by the wireless transmitter is a
function of the radiated power at the output, which controls the bit-error
rate at the receiver. The power consumed by either the computation or the
communication may dominate the total system power consumption
depending on the application conditions (communication channel, range, or
the input video sequence). Therefore, in order to minimize total power
consumption, we need to optimize over the design parameters of the video
encoder in conjunction with configuration parameters of the reconfigurable
channel encoder and transmitter.
5.3.1 Voltage-overscaled video encoder power model
Motion estimation (ME) has been reported to constitute up to 40%–70% of
the total power consumption of modern video encoders [53]. Operating the
ME engine at subcritical voltages can offer squared gains in power savings.
However, this can cause timing errors in the computation of the sum of
absolute differences (SAD) metric because of increased gate delay.
Simulations using the IBM 130 nm process model were used to characterize
65
the increase in gate delay of full-adders as a function of the
voltage-overscaling factor, kV OS, in [54]. Timing errors can cause the
encoder to choose suboptimal motion vectors, thereby lessening its
compression efficiency. This trade-off between power reduction and
compression efficiency can be exploited in scenarios with low or moderate
communication demand to lower the overall power consumption of the
mobile video communication system.
The authors in [51] presented a VOS error analysis for motion estimation.
However, that model assumed that the errors were always non-positive,
suggesting that they did not use 2’s-complement number representation.
When using 2’s-complement representation, timing errors in the sign bit
can cause the error to be either positive or negative. This behavior was
confirmed in gate-level simulations of a hardware implementation of motion
estimation [55]. The analytical error models in [51] also depend on a
multistage fast matching technique where a partial SAD is computed in
each stage. We adopt a simpler full-matching technique, since our goal is
primarily to describe the usefulness of the estimation-theoretic design
framework in reducing system power consumption. Therefore, we use the
model of computational timing errors that occurs in the ripple-carry adders
presented in [32].
The commonly used serial architecture of the SAD computation kernel
uses 16-bit registers1 and operates on the inputs (the current frame
macroblock and the reference frame macroblock) in a row-wise fashion [55].
This architecture uses ripple carry adders to accumulate the SAD values of
a macroblock. Therefore, even with an increase in gate delay, computations
that result in SADs that are smaller than a certain threshold will remain
immune to timing errors. Since the SAD outputs at the pixel locations are
small in magnitude, they are considered to be error free, but the
accumulated SAD values at the end of each row will be subject to VOS
errors. The simulation data from [54] can be used to determine the number
of full-adder computations that can be completed at a given kV OS. This
number determines the threshold above which the SADs are vulnerable to
VOS errors. We assume that when operated at the rated supply voltage
(kV OS = 1), the SAD computations are error free (i.e., all 16 bits are
1Inspection of SAD values for various test video sequences shows that this bit-width is
sufficient to prevent overflows.
66
computed correctly). At a given kV OS, the number of full-adder
computations that complete, NFA, is given by
NFA =
⌊
16
TFA
⌋
where TFA is the normalized gate delay at kV OS [54]. For 16-bit SAD
computations at a given kV OS, B = (16−NFA − 1) MSBs are prone to
timing errors. When using 2’s-complement notation, B does not include the
error-prone sign bit. Let the correct output, y, be represented (in
2’s-complement) by
y = −b0 +
N∑
i=1
bi2
−i
Let the (B + 1)-bit error-prone output, yˆ, be denoted by
yˆ = −bˆ0 +
N∑
i=1
bˆi2
−i
Then the error, γ, is given by
γ = y − yˆ = (−e02B + e12B−1 + · · ·+ eB).2−B
where ei are chosen with equal likelihood from the set {−1, 0, 1} [32].
We performed rate-distortion optimization by fixing the quantization
parameter to maintain the distortion at a fixed level. We additionally
assume that the entropy encoder presents a fixed power overhead that is
independent of bit rate. This model was implemented in software as a
modification of the H.264/AVC JM reference software [56]. Figure 5.3 shows
the rate increase as a function of the voltage overscaling factor, kV OS.
5.3.2 Reed-Solomon channel encoder power model
The 180 nm, 1.8 V CMOS reconfigurable Reed-Solomon (RS) encoder
presented in [52] offers a quantifiable means of balancing redundancy and
power consumption. The energy consumed per codeword by this encoder
67
100000
120000
140000
160000
180000
200000
10 11 12 13 14 15 16
E
nc
od
er
 R
at
e 
(b
its
)
FA computations completed within one clock cycle
Figure 5.3: Encoder output rate grows as the motion-estimation supply
voltage is scaled.
was shown to be
Ersenc(t) = 2t(2m − 1− 2t)(Egfmult + Egfadd)
where Egfmult = 3.7× 10−5 ×m3 (mW/MHz) and Egfadd = 3.3× 10−5 ×m
(mW/MHz) are the energy consumed by a multiplier and an adder over
Galois field GF (2m), and t is the number of correctable symbol errors.
These figures for energy consumption were appropriately scaled to obtain
the corresponding figures for the RS encoder designed using the 130 nm, 1.2
V CMOS technology. In our communication protocol, each RS codeword is
transmitted as a separate packet.
The performance of the channel encoder is measured by the packet error
rate suffered by the receiver. For a packet size of (2m − 1) symbols and an
error correction capability of t symbols, the packet error, ppe, is given by [52]
ppe =
2m−1∑
j=t+1
(
2m − 1
j
)
pjse(1− pse)(2
m−1−j)
68
where pse is the symbol probability of error
pse = 1− (1− pbe)m
and pbe is the probability of bit error, which is given by
pbe = Q
(√
(Eb/N0)
)
where Q(.) is the Gaussian error function, and (Eb/N0) is the ratio of the
signal energy per bit to the interference spectral density.
5.3.3 Transmission power model
The task of computing the distortion at the receiver can be simplified by
adopting a protocol that uses a feedback channel to request retransmission
if a packet is found to be in error. The power-amplifier design presented in
[52] provides a model for the power consumption of a wireless transmitter.
The energy per bit consumed by the power amplifier, Epa, was shown to be
Epa(Pt) = Pt
fbitη(Pt)
where Pt is the radiated power, η(Pt) is the power-aided efficiency (PAE) of
the power amplifier, and fbit is the bit rate.
The performance of the transmitter can be measured by the
signal-to-noise ratio seen by the receiver, as given by
SNR =
PtGtGr
LpFkT0B
where Gt and Gr are the transmit and receive antenna gains, respectively,
Lp is the path loss, F is the noise factor, k is Boltzmann’s constant, T0 is
the ambient temperature, and B is the system bandwidth. As the SNR
decreases, the receiver requests more retransmissions, and this increases the
communication overhead.
69
5.4 Estimation-Theoretic Design
The system-level application cost function, C(θ, θˆ), is the distortion
between the encoded frame and the original frame.
The cost functions for the various subsystems are as follows:
• Video Encoder: C1(θ, θˆ) = bit rate.
• RS Channel Encoder: C2(θ, θˆ) = Packet error rate.
• Transmitter: C3(θ, θˆ) = −(Signal-to-noise ratio).
The elements of the architectural constraint, A, each represent a different
choice for the supply voltage for motion estimation and number of
correctable symbols, t, of the channel coder. Each choice in this set can be
configured using a continuously variable transmit power, Pt. The
architectural constraint set can be specified as
A = V × T
where V is the set of choices for the supply voltage of the motion
estimation, and T = {1, · · · , number of symbols in a packet}. The transmit
power, Pt, is a continuous parameter of the architectural constraint set,
λ = Pt.
The problem of optimally designing the mobile video communication
system can be stated as follows and corresponds to Problem 2.1
(minimizing power subject to a cost constraint) of our framework:
[kV OS, t, Pt] = argmin {Pvenc(kV OS) + Pchenc(t, fbit) + Pxmit(Pt, L)}
subject to
D ≤ DTarget
A = V × T (5.1)
where kV OS is the voltage-overscaling factor at the video encoder, t is the
number of correctable symbol errors at the reconfigurable channel encoder,
fbit is the bit rate, Pt is the radiated power at the transmitter, and L is the
range of communication. The objective function consists of the
video-encoder power consumption, Pvenc, the power consumed by the
70
channel encoder, Pchenc, and the power consumed by the transmitter, Pxmit.
In our model, the distortion constraint, DTarget, is automatically satisfied by
fixing the quantization parameter of the encoder at a satisfactory value and
adopting a retransmission scheme for communication.
5.5 Solution Method
A gradient-descent algorithm was used to optimize over transmit power, Pt.
Since the simulation results for the gate delay were obtained at a discrete
set of values for kV OS, we searched over this parameter to find the
optimum.2 The algorithm also searched over the number of correctable
symbols, t, which is also a discrete parameter.
5.6 Results: Power Savings
Figure 5.4 gives a comparison of the optimized system (left bar plots)
relative to a system that does not allow voltage overscaling at the video
encoder, but optimally chooses only the transmit power and RS encoder
redundancy (right bar plots). Over short communication links, the system
power is dominated by the video encoder, and optimal choice of kV OS yields
about 35%–50% power savings. Over longer communication links, where
the communication power dominates, system power savings are about 18%.
5.7 Conclusion
The systems in Class III contain heterogeneous subsystems. Such systems
naturally occur in complex SoCs. This work describes a mobile video
communication system that was optimized using our estimation-theoretic
framework to reduce power consumption by up to 35%. Identifying other
potential systems that share this structure for applications in
communication systems, audio processing systems, and other media
processors is a topic of our on-going research.
2Many practical systems only allow operation at a discrete set of supply voltages.
71
1 50 100
0
0.1
0.2
0.3
0.4
Distance (m)
R
S
 E
n
c
. 
P
o
w
e
r 
(m
W
)
(b) Channel Encoding
1 50 100
0
5
10
15
Distance (m)
T
ra
n
s
m
it
 P
o
w
e
r 
(m
W
)
(c) Transmission
1 50 100
0
20
40
60
80
100
120
Distance (m)
T
o
ta
l 
P
o
w
e
r 
(m
W
)
(d) Total System Power
1 50 100
0
20
40
60
80
100
Distance (m)
V
id
e
o
 E
n
c
. 
P
o
w
e
r 
(m
W
) (a) Video Encoding
 
 
Optimal
Fixed
LEGEND
Figure 5.4: The power consumed by the video encoder, the channel encoder,
wireless transmitter and the total system power consumption are compared
in Figures (a), (b), (c), and (d), respectively. Relative power allocation for
the different subsystems varies depending on the range of communication.
72
CHAPTER 6
CLASS IV: REDUNDANCY-AIDED
SYSTEMS
Systems that have specified requirements for built-in redundancy represent
an important class and are common in space and other mission-critical
applications. This chapter presents an estimation-theoretic design for this
class by focusing on algorithmic noise tolerance (ANT) systems. The
techniques presented in this chapter extend existing work by globally
optimizing the entire system by taking into account the application
performance and the ANT architectural parameters.
6.1 Class Overview
The fourth class of systems includes explicit redundant computation to
detect and discard erroneous computations. Traditional N -modular
redundancy and ANT systems belong to this class of systems [24]. NMR
systems replicate computation and use a majority voter to discard
erroneous computation. ANT systems use a parallel lower-complexity
redundancy block that is used if the main block is found to be in error.
Figure 6.1 shows a main block aided by an estimator that computes an
approximate, but less error-prone result on the basis of the same inputs.
For NMR systems, the 0-1 cost function may be applied to the different
modules and the overall system. In ANT systems (see Figure 2.1(b)), the
main block and the estimator block do not share the same power/reliability
trade-off, but are evaluated by the same application cost function, C(θ, θˆ),
because they are different estimates of the same value. The system-level
cost function for ANT systems is composed as follows:
C(θ, θˆ) = (1− Pe,main)C(θmain, θˆmain|Ecmain) + Pe,mainC(θest, θˆest|Emain)
where Emain denotes the event that the decision block declares an error in
73
����
���� ����
��������
�������������
Figure 6.1: Class IV systems contain built-in redundancy. The decision
block computes the final output based on outputs of the main block and
the redundancy block. The redundancy block may be designed to suffer
fewer computational errors by compromising fidelity (e.g., precision).
the main block and Pe,main is the probability that this event occurs.
The system power consumption, Psys, is given by
Psys = Pmain + Pestimator + Pdecision
where Pmain, Pestimator, and Pdecision are the powers consumed by the main
block, the estimator block, and the decision block, respectively.
6.2 Application: Word-Length Optimized ANT
Systems
Recently developed algorithmic noise-tolerance (ANT) error-control
schemes [24] employ a lower-complexity redundancy block that operates in
parallel with an error-prone main block. These ANT systems aggressively
scale the supply voltage of the main DSP block, while allowing it to make
errors. The output of a parallel lower-complexity estimator block is used if
the main block is found to be in error. The ANT systems have also been
shown to be effective when errors are caused due to nanoscale artifacts.
74
These schemes obtain significant energy savings when compared with
traditional N -modular redundancy systems, because the redundant
computation is designed to be of lower complexity.
6.3 System and Noise Model
The reduced precision redundancy (RPR) ANT [32] system pairs a main
DSP block with a lower-precision estimator block. While voltage
overscaling or process variations may cause timing errors in the main DSP
block, the lower-precision block is designed to be immune to such errors,
thereby offering system power savings. The power reduction gained by
decreasing word length typically comes at the cost of lowered performance
in terms of signal-to-quantization-noise ratio (SQNR). The main block and
the estimator block do not share the same power/reliability trade-off, but
are evaluated by the same application cost function, C(θ, θˆ), because they
are different implementations of the same computation, allowing us to
directly compare their outputs.
6.4 Estimation-Theoretic Design
The system-level application cost function, C(θ, θˆ), is the SQNR. This is
determined by the cost functions of the main block and the estimator block
as shown below:
C(θ, θˆ) = (1− Pe,main)C(θmain, θˆmain|Ecmain) + Pe,mainC(θest, θˆest|Emain)
where Emain denotes the event that the decision block has declared an error
in the main block, Pe,main is the probability that this event occurs, θmain is
the desired output of the main block, θˆmain is the actual main block output,
and θest and θˆest are the corresponding desired and actual outputs of the
estimator block, respectively.
The architectural constraint set, A is given by
A = B1 × B2 × VDD,main × VDD,RPR
75
where B1 and B2 are sets of choices of the word lengths of the main and
estimator blocks, respectively, and VDD,main and VDD,RPR are their
respective supply-voltage choices.
6.4.1 Estimation-theoretic problem statement
Using the estimation-theoretic framework, we can reduce power
consumption by jointly optimizing over the word lengths of both the main
block and the estimator, while meeting application SQNR performance. Let
the word length of the main block and estimator block be B1 and B2,
respectively. This corresponds to a power consumption of Pmain and PRPR,
respectively. This problem may be formalized as follows:
[B1, B2, VDD,main, VDD,RPR] = arg{minPmain(B1, VDD,main) + · · ·
+ PRPR(B2, VDD,RPR)}
subject to
SQNRavg ≤PTE(B1, VDD,main)SQNRRPR(B2) + · · ·
+ {1− PTE(B1, VDD,main)}SQNRmain(B1)
VDD,main ≥VDD,RPR
B2 ≤B1
A =B1 × B2 × VDD,main (6.1)
where VDD,main and VDD,RPR are the supply voltages of the main block and
the estimator block, respectively.
This corresponds to Problem 2.1 of our estimation-theoretic framework.
6.5 Solution Method
The above optimization problem can be solved using a brute-force search.
This is acceptable because it is a relatively small search that needs to be
performed just once prior to hardware design.
76
Table 6.1: Optimal word-length choices for an ANT system resulting from
the estimation-theoretic design.
SQNR Main Estimator Power
(dB) VDD(V ) WL VDD(V ) WL (microwatts)
30 0.75 5 0.75 4 62.50
50 0.75 9 0.75 4 90.33
70 0.75 13 0.75 4 118.17
90 0.80 16 0.75 4 156.14
95 0.90 16 0.75 6 205.13
6.6 Results: Optimized Word Lengths
Simulation in SPICE and VERILOG characterized the delay for full adders
ranging from 4 to 16 bits wide at different supply-voltage settings using the
IBM 130 nm CMOS transistor model [57]. If an application’s SQNR
requirement is known, a design based on this optimization avoids
overprovisioning of word length and supply voltage. Table 6.1 lists the
optimal choice for the word lengths of the main and estimator blocks for
different application SQNR requirements. The minimum system power
consumption for each SQNR requirement is also listed.
6.7 Conclusion
The ANT systems that belong to Class IV are natural candidates for the
optimization formalization described here. This optimization allows the
designer to tailor an ANT system exactly to match the performance needs
of the application without overprovisioning voltage or hardware word
lengths. This work can be extended for use in general-purpose N -modular
redundancy systems to yield new fusion schemes that can potentially
outperform the commonly used majority-voting scheme.
77
CHAPTER 7
SCALABLE STOCHASTIC PROCESSOR
The previous chapters have addressed the design of systems-on-chip for
specific applications. General-purpose embedded processors are now
becoming a popular choice for many applications because they can be easily
reprogrammed to accommodate changing standards and protocols.
Addressing the power constraints in these systems is, therefore, an
important problem. Modern embedded processors typically contain
multiple functional units for each arithmetic operation and, sometimes,
multiple replicated cores. This redundancy presents new opportunities for
power/reliability trade-offs.
This chapter draws from results of the previous chapters to propose a
new dynamic micoarchitecture that exploits this redundancy and adapts to
changing application demands. This work was done in collaboration with
John Sartori, who performed the CAD characterizations of the error-rates
of the different functional-unit architectures. My contribution was in the
design proposals and the development of the video application. This
chapter is an expansion of our recent publication in [58].
7.1 Introduction
The emergent ubiquitous computing paradigm promises new applications in
environmental monitoring, automation, and health care. For these new
applications to be practical, the computing platform needs to offer high
performance while operating within a very limited power budget (often
micro- or even nanowatts). While technology scaling driven by Moore’s law
has offered continued reduction in power consumption and size, recent
projections from the International Technology Roadmap for
Semiconductors (ITRS) suggest that this scaling trend alone will not be
78
sufficient to meet the demands of these future applications [59].
Dynamic voltage/frequency scaling (DVFS) has proven to be an effective
power-reduction technique. For example, the authors in [60] and [9]
dynamically adjust supply voltage and clock frequency across
voltage/frequency islands while taking into account process variations, soft
error rate, and power consumption. The DVFS techniques are necessarily
limited, however, by the particular path-delay characteristics of the
underlying architecture. Aggressive scaling can cause violations of timing
constraints and result in timing errors. In the present-day design flow,
architectural choices for the various functional units (FUs) are made not
with intent to allow voltage- or frequency-scaling but to minimize power
and area for operation at the nominal voltage/frequency. In traditional
high-performance designs, all timing paths are tuned to match the length of
the critical path. An implication of this design style is that when one
timing path fails, a large number of other timing paths fail, since all path
lengths are bunched around the critical path length. In fact, the critical
operation point hypothesis (COP) recently posited by J. Patel [61] claims
the following regarding general-purpose processors:
In large CMOS circuits there exists a Critical Operating
Frequency Fc and Critical Voltage Vc for a fixed ambient
temperature T , such that
• Any frequency above Fc causes massive errors
• Any voltage below Vc causes massive errors
• [At any] frequency below Fc or voltage above Vc, no
process related errors occur
In practice, Fc and Vc are not single points, but are confined to
an extremely narrow range for a given ambient temperature Tc[.]
This suggests that computational platforms need to be designed from the
ground-up to allow scaling.
An alternative design strategy to highly optimized design is to create an
architecture that fails gracefully rather than catastrophically [62, 63, 64].
With such a design, timing path lengths are spread out, rather than being
clustered near the critical path length. This design style has its drawbacks.
79
Since timing paths can have very disparate lengths, operating such designs
above the critical point, which is required for correctness, often incurs
substantial power and performance overhead [62, 64, 65]. These overheads
are due to the fact that for reliable operation at a common frequency,
gracefully failing designs must be operated at a higher voltage and thus
consume more power. Therefore, these designs may not be appropriate in
situations where correctness of the output is of utmost importance and
there is some slack available in the power budget.
Researchers in [66, 67] have used CAD-based methods to redistribute
timing slack among the various paths to design architectures that can have
tailored reliability characteristics. These method allow the designer to
adjust power/reliability trade-offs of a given architecture for a computation.
This chapter seeks to exploit the diversity in power/reliability behaviors of
different architectures for a computation dynamically to adapt to changing
application needs. Therefore, the methods in [66, 67] can be used
independently of the techniques presented in this chapter.
While many applications can never tolerate any timing errors and
therefore demand that scaling be limited to super-critical regimes, there
exist a class of applications that can sometimes tolerate a small number of
timing errors in computations. The work in [13] proposes a novel
co-processor design that exploits noise as an agent for low-power
computation. This work targets applications that involve some randomized
computation. The authors in [68] propose another system architecture to
exploit inherent error tolerance of some applications. Their system
architecture consists of a small number of highly reliable cores that work
with a larger number of unreliable cores, and the different parts of the
application are mapped to these cores on the basis of their reliability
requirements. While these approaches have been shown to overcome
hardware errors, they impose a nontrivial overhead when error rates are low
or zero. For instance, if there are execution phases when the system
demands the highest performance and can afford to provision design
guard-bands, then it is desirable that the architecture scales to these
demands.
A scalable architecture that can adapt to dynamic application demands
by operating in multiple reliability modes — ranging from the traditional
highly optimized design to the gracefully degrading scaling-friendly designs
80
— will have a definite power advantage. Multimedia applications and
wireless communication applications often exhibit this characteristic of
time-varying reliability requirement. The inputs in these applications are
typically contaminated by system/input noise, and, therefore, timing errors
simply present a new source of noise. In this chapter, we choose a video
encoding application as an example and describe modifications that result
in architectures that degrade gracefully.
7.2 Scalable Architectures
Scalability can be achieved by replacing or supplementing traditional
functional units (FUs) with gracefully degrading units. Timing-error-prone
functional units may be incorporated into present-day systems at three
broad levels:
1. Design 1: Fixed In this design, the baseline architecture is
modified by replacing one or more functional units with an
alternatively designed functional unit that is more conducive to
voltage/frequency scaling. As shown in Figure 7.1(a), the execution
unit consists of voltage scaling-friendly blocks (dotted lines) and is
able to lower computation accuracy gradually with supply voltage.
This design extends the range of voltage/frequency overscaling and is
suitable for applications that never demand maximum reliability, but
impose a very limited power budget.
This design point represents the least change to existing instruction
set architecture (ISA) and programming models. But the ability to
scale comes at the cost of compromising optimal power/performance
trade-off when such scalability is not desired.
2. Design 2: Functional Unit selectable In this design, the
baseline processor is equipped with two different functional unit
architectures, FU-A and FU-B. The application may choose to switch
between the two functional units such that the overscaling range is
extended. The execution unit in Figure 7.1(b) contains two types of
logic blocks: traditional performance optimized version (solid lines)
and an alternative design that is friendly to voltage scaling (dotted
81
lines). This design is suitable for applications that have time-varying
power/performance demands.
For such designs we envision a modified ISA that allows the
application layers to choose particular functional units. Reliability
requirements of applications can be annotated in software, and these
annotations can be used to select the appropriate functional unit for a
program or program phase. The current reliability target can be used
to control the select lines of an MUX that routes an instruction
through the most power-efficient module. Since tuning a module for a
specific error rate requires voltage scaling, module switching incurs
overhead time for voltage scaling when the module must achieve
different reliability targets within the same program. Overhead time
is proportional to the voltage differential between reliability states.
3. Design 3: Core selectable This design consists of a multi-core
system where each core possesses a different architecture for the
functional units. Figure 7.1(c) shows such a dual-core design, along
with a task-to-core scheduler that is responsible for assigning tasks
according to their reliability requirement. Unlike other multi-core
designs such as [68], the scalable cores of this design can dynamically
be made error free by adjusting the supply voltage or clock frequency.
This design is suitable for applications that can be decomposed into
subcomputations that have different power/performance demands.
This design may include Design 2 if each core has selectable FU
architectures.
The heterogeneity of these multi-core systems comes from the
different levels of reliability. These cores may or may not share a
common ISA. Such systems will employ a task-to-core scheduler that
is aware of the power/reliability trade-offs of the cores.
For a class of embedded applications that are data dominated, it is
common for the data path to significantly contribute to the total power
dissipation. For an audio decoding benchmark, in the Philips TM3270
media processor [69], the execute module consumes around 0.255 mW/MHz
out of a total processor power consumption of 0.935mW/MHz,
corresponding to a 27% contribution. Therefore, we restrict our
82
Fetch 
&
Decode
Instruction
Buffers
FPU & IU
Memory
Interface
Vdd
(a) Design D1
Fetch 
&
Decode
Instruction
Buffers
FPU & IU
Memory
Interface
Vdd
(b) Design D2
Fetch 
&
Decode
Instruction
Buffers
FPU & IU
Memory
Interface
Vdd
Fetch 
&
Decode
Instruction
Buffers
FPU & IU
Memory
Interface Task-to-Core Scheduler
Scaling-Friendly Core
Highly Optimized Core
(c) Design D3
Figure 7.1: The scalable architecture introduces alternative functional units
at three levels. In (a), all the functional units of the core are replaced by
scaling friendly versions; (b) shows two different functional unit
architectures that can be selectively used; and (c) shows a
reliability-defined heterogeneous multi-core system.
power-reduction design techniques to the data path alone and demonstrate
power savings in such applications.
7.3 Functional-Unit Architectures
The scalable architectures presented here exploit the diversity of
power/reliability trade-offs presented by different functionally equivalent
architectures. In this study, we show power-efficiency benefits for functional
units that have been designed for a particular reliability state. To
characterize their behavior, functional units are synthesized in the IBM9SF
90 nm CMOS technology with the Synopsys DesignCompiler [70], and
83
layout is performed in the Cadence SoC Encounter [71]. To measure power
and error rate across a range of voltages, we use voltage-specific Synopsys
Liberty (.lib) files prepared with Cadence SignalStorm [72]. To obtain an
accurate characterization of module behavior, we perform gate-level
simulations using an input set of 180,000 random input samples.
For results that assume Razor correction, the power consumption in
Razor-based designs is adjusted to account for the power overhead of Razor
flip-flops during normal operation, the buffering overhead required to
satisfy Razor correctness constraints, and the error recovery overhead
introduced by Razor when it detects and corrects an error.
7.3.1 Reliability and efficiency in functionally equivalent units
To compare functionally equivalent modules with different architectural
designs, we consider the Kogge-Stone adder (KSA), which is highly
optimized and fails catastrophically with voltage overscaling, and the ripple
carry adder (RCA), which fails gracefully but is much slower than the KSA.
Figure 7.2 shows how the error rate of different adder architectures varies
as voltage is scaled down.
The KSA is highly optimized and can be scaled to a much lower voltage
(0.9 V compared to 1.2 V for the RCA) before producing errors. However,
once errors occur, the adder fails catastrophically. On the other hand, the
error rate of the RCA increases gradually as voltage is scaled down.
However, the onset of erroneous behavior is much earlier than in the KSA,
so that a conservative voltage must be chosen to guarantee fidelity of the
output. Because of these failure characteristics, the functionally equivalent
modules have very different power/reliability characteristics. Figure 7.2
compares the power consumption of the adders at different error rates.
For reliable operation (0% error rate), the KSA consumes 25% less power
than the RCA. This is because, for the same frequency, voltage on the KSA
can be scaled down to save power, while scaling down voltage on the RCA
would cause timing errors. However, power/reliability trade-offs are not
possible for the KSA, since reducing voltage past the critical point causes
massive failure, so power consumption is the same for all error rates.
While the RCA is less efficient when operating completely reliably, its
84
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0.70 0.80 0.90 1.00 1.10 1.20
Vdd
Er
ro
r 
R
a
te
 
 
 
.
RCA
KSA
Lowest voltage 
with no errors 
for RCA
Lowest voltage 
with no errors 
for KSA
(a) Error Rate vs. Voltage
2.00E-04
2.50E-04
3.00E-04
3.50E-04
4.00E-04
4.50E-04
5.00E-04
5.50E-04
6.00E-04
0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16
Error Rate
Po
w
e
r 
 
.
RCA
KSA
(b) Power vs. Error Rate (no correction)
Figure 7.2: (a) The RCA allows power/reliability trade-offs so that power is
reduced as the error rate is allowed to increase. (b) The KSA, on the other
hand, consumes less power for reliable operation, but does not allow
power/reliability trade-offs.
85
gradual failure characteristic allows reliability to be traded for power
savings, making RCA favorable for noisy environments. For all non-zero
error rates, the RCA consumes less power than the highly optimized KSA.
7.3.2 Support for error tolerant applications
Reliable computation in the face of noise requires error recovery. As
demonstrated in Figure 7.3, power can be reduced by incorporating Razor
into a design to detect and correct errors at the circuit level. The total
power consumption (including correction overhead) is plotted at different
pre-correction error rates. As the supply voltage is scaled, the total power
consumed decreases until an error rate of around 1% is encountered. The
cost of error recovery at a particular error rate scales linearly as the error
rate increases. When the error rate versus voltage curve increases very
gradually, there is a large range over which voltage can be scaled before
reaching a higher error rate, resulting in power savings that exceed the cost
of error recovery and a decrease in the power versus error rate curve. On
the other hand, when the error rate versus voltage curve increases very
steeply, as it does for RCA when it jumps between two plateaus, there is a
large increase in error rate and very little power savings is obtained by
voltage scaling. Therefore, the power versus error rate curve can increase,
decrease, or remain level between successive error rates (e.g., at 2%).
In general, as the error rate increases, the recovery overhead of Razor
begins to dominate the power savings achieved through voltage scaling,
which limits the range over which power/reliability trade-offs can be made.
This trend may suggest that gate-level techniques that seek to correct every
instance of hardware error may be inadequate in some application regimes.
In contrast, system-level approaches that do not correct every error instance
and allow some errors to be masked will have a definite advantage.
7.3.3 Dynamic module trade-offs
Since different functional-unit designs have different efficiencies at different
error rates, an architecture that allows instructions to be routed to the
optimal module for a given error rate can achieve benefits over a static
86
7.00E-04
9.00E-04
1.10E-03
1.30E-03
1.50E-03
1.70E-03
1.90E-03
0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04
Error Rate
Po
w
e
r 
 
.
RCA
KSA
Figure 7.3: Razor error recovery can provide some power savings for
gracefully failing designs (RCA) after the point of first error. However,
these benefits are limited, since only a small number of errors can be
gainfully tolerated before recovery overhead outweighs voltage-scaling
power reduction. Note that the quantity on the x axis is the error rate prior
to recovery.
module selection, which will be suboptimal in scenarios when both reliable
and unreliable operations are required. Figure 7.4 shows that power
consumption is minimized at each error rate when dynamic module
selection is performed on the basis of the target error rate.
When stand-alone application-level error tolerance is used,
power/reliability trade-offs are available over the entire range of the power
versus error-rate curve (Figure 7.2). Thus, when a higher error rate is
tolerable for an application, the application can achieve additional power
reduction on a gracefully failing architecture.
7.4 Stochastic Applications
The scalable architecture developed in this work targets aggressive power
reduction for a class of stochastic applications, i.e., applications that can
operate with a priori known reliability requirements. The reliability
requirement may change with time and execution phases within the
application. Multimedia applications are typical examples of stochastic
87
2.00E-04
2.50E-04
3.00E-04
3.50E-04
4.00E-04
4.50E-04
5.00E-04
5.50E-04
6.00E-04
0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16
Error Rate
Po
w
e
r 
 
. RCA
KSA
Dynamic
(a) Power vs. Error Rate (no correction)
7.00E-04
9.00E-04
1.10E-03
1.30E-03
1.50E-03
1.70E-03
1.90E-03
0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04
Error Rate
Po
w
e
r 
 
. RCA
KSA
Dynamic
(b) Power vs. Error Rate (with correction)
Figure 7.4: Module selection based on target error rate allows for
power-efficient operation in both reliable and unreliable operational phases.
In (a), the power consumption vs. error-rate profiles for the two modules
are very different when no error correction is applied. In (b), an
error-correction mechanism such as Razor is applied, and the resulting
power consumption of the two modules is shown as a function of the error
rate prior to correction.
88
applications. Here, the input (such as a feed from a camera or a
microphone) is already contaminated by measurement noise. Since these
applications are increasingly implemented on fixed-point mobile platforms,
quantization poses another source of noise. Furthermore, the output in
these applications needs only to meet the fidelity discernible by human
sensory acuity. The scalable architectures described in this work can offer
significant power/throughput gains to these applications, if we treat
computational errors as a new source of noise.
As a particular example, we describe the advantages of a scalable
architecture to the popular H.264 video encoding application. The high
compression efficiency of this new video encoding standard has enabled
exciting applications in wireless video communication and is increasingly
implemented on battery-constrained mobile devices. An important
subsystem of the video encoder, the motion estimation engine, is often
reported as contributing about 40% to 50% of the total encoder power
consumption on ASIC implementations [53]. The main computational
kernel of the motion estimation engine is the sum of absolute difference
(SAD) that computes |A−B| for two inputs A and B. We seek to gain
power savings through voltage overscaling, while allowing any resulting
timing errors. A computational error in the motion estimation engine
simply results in poorer encoding efficiency, that is, a larger bit rate. Such
errors could result in non-zero motion vectors even if the current and
reference frames are identical (i.e., there was no motion in the video
sequence). This, in turn, will adversely impact the power overhead of
wireless communication. Consequently, during periods of relative inactivity
in the input video sequence or favorable wireless channel conditions, the
motion estimation block may contain some slack that can be exploited for
power reduction. Since models describing wireless communication overhead
are beyond the scope of this chapter, we will use the bit rate as a measure
of performance. By controlling the occurrence of timing errors in motion
estimation, a scalable processor can trade-off bit rate for power reduction.
But for this approach to work, the bit rate must worsen gradually with
scaling supply voltage or clock frequency.
In order to study the impact of voltage overscaling on the compression
efficiency of the H.264 video encoder, we used a PC implementation of the
JM reference software [56]. The experimental setup described in Section 7.3
89
0.00E+00 
1.00E-04 
2.00E-04 
3.00E-04 
4.00E-04 
5.00E-04 
6.00E-04 
7.00E-04 
980 990 1000 1010 1020 1030 1040 
Dynamic 
Ripple Carry 
KSA 
Po
w
er
 p
er
 a
dd
er
 
Bitrate (kb/s) 
Figure 7.5: The RCA is able to significantly lower the power consumed (per
adder) without compromising bit rate of the output. But the KSA is able
to offer about 20% lower power consumption when no bit rate degradation
can be allowed.
was used to obtain the probabilities of bit error for a 16-bit word length.
This probability model was used to inject errors into the motion-estimation
block of the JM reference software. Similar probabilistic models for bit
errors due to voltage overscaling have also been developed by other
researchers [32].
We used three frames of a QCIF-format video source in 4:2:0 YUV
format as our input. If there is demand for the lowest achievable bit rate,
the application chooses the KSA and is able to consume about 20% less
power (when compared with the lowest power option for the RSA that
offers this best bit rate). If the application is able to tolerate some
worsening of the bit rate, then it switches to the RCA. A small increase in
bit rate of about 12kb/s (i.e., a 1.2% loss) is able to reduce the power
consumption by about 60%. The scalable architecture that is able to switch
between the FU architectures at runtime is able to maintain optimal power
consumption at all levels of bit rate demand, as shown in Figure 7.5.
As an example implementation, consider Design 2. The scalable
processor will receive input from the wireless subsystem (responsible for
packetizing and communicating encoded video data) regarding the quality
90
of the communication channel. Under adverse channel-quality conditions,
the processor will use the KSA adders by issuing the correspondingly
annotated instructions. Under more favorable channel conditions, it will
issue the instructions annotated to use the RCA adders. The processor will
then proceed to lower the supply voltage according to channel information
received from the wireless subsystem. By constantly adapting the issued
instructions to changing wireless channel quality, this scalable architecture
maintains the lowest-possible power consumption.
7.5 Conclusion
Many applications can tolerate a small number of computational errors at
least during some execution phases. Exploiting this inherent error tolerance
for power gains has been a topic of recent research interest. In order for
new stochastic architectures to be practically viable, they need to adapt to
time-varying application demands; in particular, they should not
compromise performance when the application demands are high and
supply-voltage slack may be warranted. The scalable stochastic
architecture presented in this work exposes multiple alternative designs of
functional units to the application and, thereby, allows scaling over a wide
range. At the same time, it is able to provide the application maximum
performance when demanded. This scalability offers the proposed system
20% to 60% power savings in the motion estimation block of a mobile video
communication application.
91
CHAPTER 8
CONCLUSIONS
The goal of this research was to develop systematic design methods for
low-power systems that exactly match performance and reliability to
time-varying application demands. Toward this goal, this research
developed a new general framework for designing systems that optimally
trade power and reliability, while meeting performance constraints. The
strength of this work is in describing real application scenarios where this
framework yielded power and performance improvements.
The essential idea for this research was the realization that hardware
errors caused by reliability issues may be viewed as analogous to system or
input noise. Many applications in signal processing, wireless
communication, and multimedia are typically evaluated using statistical
metrics. Therefore, there may be some available slack at the application
level that can be exploited for power gains. The estimation-theoretic
framework presented in Chapter 2 is a design optimization formalization
that captures engineering challenges, such as cost of hardware errors and
availability of architectures, and provides valuable links to powerful
methods from estimation theory.
This dissertation identifies three classes of applications in which the
estimation-theoretic framework can offer improvements in power and
robustness. These classes were meant to describe the diversity of
applications that can benefit from the framework and are not meant to
exhaust all possible applications. Other new and mixed classes can
certainly benefit from this estimation-theoretic view.
The applications in Class I contain parallel and identical subsystems.
Parallelization has often been used in practical implementations as a means
of improving throughput, but here it has been additionally exploited it for
gaining robustness. The multiple simultaneous subcomputation results
serve as observations analogous to those in classical estimation theory
92
contexts, and the final result is viewed as an estimate that is based on
them. In one application from this class, the code acquisition block of
wireless CDMA systems, an estimation-theoretic redesign resulted in
system power reduction of 30% to 40% under voltage overscaling, and
about 800× improvements in robustness to process variations.
The results of traditional robust estimation theory that motivated the
designs of Class I systems impose the restriction that the nominal noise
samples across the observations be independent. For general robust
parameter estimation problems with dependent observations, Chapter 4
developed methods in that are computationally efficient. The multiantenna
communication receiver application gains up to 18× improvements in mean
squared error when voltage overscaling is adopted as a power-reduction
technique. Addressing this particular context of robust system design,
uncovered an important robust estimation problem that has remained
unaddressed and that has broad applicability in general distributed
signal-processing contexts with error-prone sensing mechanisms. The
currently available theoretical extensions of robust statistics tend to restrict
the nature of the correlation to very specific models. The methods
developed in this work have been motivated by intuition and outperform
other published methods; but more research is required to determine
whether this method is theoretically optimal.
Not all applications will readily lend themselves to parallelization. Many
SoCs typically consist of a series of connected subsystems. The power
consumption and reliability profiles of these subsystems could be very
different from one another. Chapter 5 exploits this diversity in power,
reliability, and performance behaviors characteristic of this class of
applications to adapt to time-varying application conditions. In a wireless
video communication system, the choice of the different system parameters,
such as supply voltage and channel encoder rate, can vary significantly
depending on operating conditions, such as wireless channel conditions.
Optimal choice of these parameters, resulting from adopting the
estimation-theoretic framework, reduced the power consumption by 30%.
Signals captured from the physical world (e.g., video, audio, or speech)
typically exhibit a large variability in their properties and, therefore, this
class may come to include many future applications that are closely
integrated with their environment.
93
Redundancy-aided systems such as NMR continue to be employed in real
implementations. Such systems comprise the third class presented here.
Chapter 6 demonstrates the usefulness of the framework in an ANT system
that avoids overprovisioning of resources by being tailored to specific
application requirements. These designs help to reduce the overhead cost of
recomputation, while retaining its robustness benefits.
In addition to addressing SoC designs, this research has also investigated
the feasibility of redesigning general-purpose processors for use in stochastic
computation contexts. The scalable stochastic processor proposed in
Chapter 7 is able to dynamically switch between functional units to remain
power optimal when the application performance demands change. In a
video application benchmark, this design was able to lower power
consumption by 60% for a loss of about 1.2% in coding efficiency. This
evidence supports the argument for extending the application-aware design
ideas employed in SoCs to programming general-purpose computing
platforms. This work has focused only on the microarchitectural aspects of
the datapath, and the results may be representative only of data-dominated
applications. As such, the overhead of the control unit and other layers,
such as the run-time system, must be minimal. Future research into
efficient design of the other aspects of the processor are required to
translate the savings obtained here to system-level gains.
The success of this research is attributable to the highly interdisciplinary
approach that was employed. All projects undertaken here took a
system-level estimation-theoretic view, but quantifying the power,
reliability, and performance required close collaborations with researchers in
other specializations. Many disparate factors contribute to the
power/reliability problem, and the solution may call for such
interdisciplinary research efforts.
8.1 Future Work
The most exciting future applications of information technology are in
systems that closely integrate a number of processing elements with their
physical environment. I envision pervasive computing systems as containing
sensors of various modalities that enable integration with the physical
94
environment. A primary challenge in these systems will be to develop
techniques for processing multi-modal data gathered by these sensors. The
on-line estimation of power availability and performance/reliability
demands will require research in real-time signal processing. One future
research direction is to develop methods to acquire information regarding
dynamic power, reliability, performance demands and to configure
algorithms and communication protocols accordingly. I propose to develop
detection and estimation algorithms that can be dynamically configured to
achieve the necessary trade-offs.
Once configured, the software layer needs to be mapped onto the
computing platform. The computing platform will present a variety of
choices in terms of available and reconfigurable processing-element
architectures. Recent research efforts, such as in [73], have shown that
microarchitectural design choices can have drastically different power and
performance implications in subthreshold processors. Similar observations
were made in this work on stochastic processors that operated in
super-threshold regimes. This diversity in power/performance behavior
presents a rich opportunity for system-level trade-offs. No one set of choices
can be optimal over the entire range of application demands typical of
highly dynamic pervasive systems of the future. I propose to treat the
problem of selecting appropriate functional-unit architectures on the basis
of real-time application demands as a resource allocation problem and to
develop efficient methods.
The work in Chapter 4 has developed methods for extending traditional
robust estimation theory to include correlated noise models. The methods
in Chapter 4 lack a theoretical foundation, and are derived from intuition.
Future research will develop optimal methods that can estimate a
parameter in the presence of correlated noise in a robust manner. This line
of research will have applications in the broader context of distributed
signal-processing systems in which the sensing mechanisms are subject to
many modes of failure.
95
APPENDIX A
PROOF OF THE EQUIPARTITION
THEOREM
Lemma 1. The variance of the output of the j-th filter bank of a k-phase
matched filter is given by
E2
Ej
σ2x
where
E is the energy of the matched filter,
(∑k
j=1
∑nj
i=1 h
2
ij
)
Ej is the energy of the j-th filter bank,
∑nj
i=1 h
2
ij
σ2x is the variance of the iid input noise.
Let hij be the i-th coefficient of the j-th filter bank and nj be the number
of taps in the j-th filter bank. The output of each bank needs to be scaled by
a factor of
Pk
j=1
Pnj
i=1 h
2
ijPnj
i=1 h
2
ij
to produce an estimate of the overall computation.
Let σˆ2j denote the variance of the unscaled output of the j-th filter bank and
σ2j denote the variance of the j-th scaled estimate.
σ2j =
(∑k
j=1
∑nj
i=1 h
2
ij∑nj
i=1 h
2
ij
)2
σˆ2j
=
(∑k
j=1
∑nj
i=1 h
2
ij∑nj
i=1 h
2
ij
)2( nj∑
i=1
h2ij
)
σ2x (matched filter with coefficient vector h)
=
(∑k
j=1
∑nj
i=1 h
2
ij
)2
∑nj
i=1 h
2
ij
σ2x
=
E2
Ej
σ2x
The heteroscedastic extensions of robust statistics presented in [38]
become relevant when the lengths of the different filter banks are different,
and the resulting variances of the different estimates are unequal. In [38],
the set of noisy observations, Y1, · · · , YN , are divided into subsets with
96
equal variance. The j-th subset consists of measurements Yj1, · · · , YjNj . For
a class of influence functions, the maximum-likelihood estimation is
generalized by the following weighted M -estimation problem that is used to
solve for the estimate, T :
k∑
j=1
wj
Nj∑
i=1
ψ(Yji − T ) = 0
where k is the number of equal-variance subsets, Nj is the number of
observations in the j-th subset, and wj is the weight assigned to the j-th
subset.
Fact 1. Influence functions, ψ, chosen from a class of functions that are
positively homogeneous with respect to a multiplier, enjoy the following
property [38]:
ψ(x) = ψα(x) = sgn(x)|x|α, α ≥ 0
The following results hold for a positively homogeneous influence function:
ψα(vj(Yi − T )) = vαj ψ(Yi − T )
vjψ
′
α(vj(Yi − T )) = vαj ψ
′
(Yi − T )
Lemma 2. Let µˆ be the robust estimate of the mean, µ. Then, under
certain technical assumptions, if we restrict the influence function to be
chosen from the class of positively homogeneous functions,
√
N(µˆ− µ)
converges to a zero-mean normal distribution with the following minimum
variance, V [38]:
V =
k∑
j=1
θj
[
∫
ψ2(x)dF ∗(x/σj)]
[
∫
ψ′(x)dF ∗(x/σj)]2
(A.1)
where F ∗ is the distribution of (Yi − µ)/σi. Yi are the observations with a
common mean, µ, and unequal scale, σi, and N is the number of
measurements. See [38].
Theorem 2. A robust k-phase matched filter implementation produces an
estimate with minimum variance if the filter banks are of equal energy.
97
Proof. The optimization problem of interest can be stated thus:
minV =
k∑
j=1
θj
[
∫
ψ2(x)dF ∗(x/σj)]
[
∫
ψ′(x)dF ∗(x/σj)]2
subject to
k∑
j=1
Ej = E
Consider the j-th term of V .
θj
[
∫
ψ2(x)dF ∗(x/σj)]
[
∫
ψ′(x)dF ∗(x/σj)]2
= θj
[
∫
ψ2(σju)dF
∗(u)]
[
∫
ψ′(σju)dF ∗(u)]2
(let x/σj = u)
= θj
[
∫
σ2αj ψ
2(u)dF ∗(u)]
[
∫
σα−1j ψ
′(u)dF ∗(u)]2
(Fact 1)
= θjσ
2
j
[
∫
ψ2(u)dF ∗(u)]
[
∫
ψ′(u)dF ∗(u)]2
= θj
{
E2
Ej
σ2x
}
[
∫
ψ2(u)dF ∗(u)]
[
∫
ψ′(u)dF ∗(u)]2
(Lemma 2)
Consequently, the optimization problem can be written as
minV =
k∑
j=1
θj
[
∫
ψ2(u)dF ∗(u)]
[
∫
ψ′(u)dF ∗(u)]2
{
E2
Ej
σ2x
}
subject to
k∑
j=1
Ej = E
This problem has a convex objective function and a convex constraint. The
Lagrange multiplier, λ, can be used to reformulate the above problem as an
unconstrained problem.
min J(λ) =
k∑
j=1
θj
[
∫
ψ2(u)dF ∗(u)]
[
∫
ψ′(u)dF ∗(u)]2
{
E2
Ej
σ2x
}
+ λ
(
k∑
j=1
Ej − E
)
For a fixed λ, the Lagrangian cost is minimized if ∂J(λ)/∂Ej = 0, for all
98
j = (1, · · · , k). Therefore,
∂
∂Ej
{
θj
[
∫
ψ2(u)dF ∗(u)]
[
∫
ψ′(u)dF ∗(u)]2
{
E2
Ej
σ2x
}}
= −λ (for j = 1,· · · , k) (A.2)
Hence, ∀ (j, k) ∈ {1, · · · , k}
[
∫
ψ2(u)dF ∗(u)]
[
∫
ψ′(u)dF ∗(u)]2
E2θjσ
2
x
−1
E2j
=
[
∫
ψ2(u)dF ∗(u)]
[
∫
ψ′(u)dF ∗(u)]2
E2θkσ
2
x
−1
E2k
(A.3)
Since the expression in (A.3) is monotonic in Ej, it is necessary that
Ej = Ek,∀j, k ∈ {1, · · · , k}
Corollary 2. The robust k-phase PN-code correlator produces an estimate
with minimum variance if the filter banks are of equal length.
Proof. The coefficients of the PN-code acquisition filter are either +1 or
−1. Therefore,
Ej =
nj∑
i=1
h2ij = nj
From Theorem 2, the PN-code correlator produces an estimate with
minimum variance if nj = nk,∀j, k ∈ {1, · · · , k}.
99
REFERENCES
[1] R. Dennard, F. Gaensslen, H. Yu, V. Rideout, E. Bassous, and
A. LeBlanc, “Design of non-implanted MOSFETs with very small
physical dimensions,” IEEE J. Solid-State Circuits, vol. SC-9, no. 5,
pp. 256–258, October 1974.
[2] J. M. Rabaey, A. Chandrakasan, and B. Nikolic, Digital Integrated
Circuits, A Design Perspective, 2nd ed. Englewood Cliffs, NJ:
Prentice Hall, 2002.
[3] C. Constantinescu, “Trends and challenges in VLSI circuit reliability,”
IEEE Micro, vol. 23, no. 4, pp. 14–19, July 2003.
[4] N. Patil, J. Deng, A. Lin, H.-S. P. Wong, and S. Mitra, “Design
methods for misaligned and mispositioned carbon-nanotube immune
circuits,” IEEE Trans. Comput.-Aided Design Integr. Circuits Syst.,
vol. 27, no. 10, pp. 1725–1736, 2008.
[5] D. Ernst et al., “Razor: circuit-level correction of timing errors for
low-power operation,” in Proceedings ACM/IEEE International
Symposium on Microarchitecture (MICRO), vol. 24, no. 6,
November-December 2004, pp. 10–20.
[6] S. Mitra, N. Seifert, M. Zhang, Q. Shi, and K. Kim, “Robust system
design with built-in soft error resilience,” IEEE Computer, vol. 38,
no. 2, pp. 43–52, February 2005.
[7] R. Hegde and N. R. Shanbhag, “Soft digital signal processing,” IEEE
Trans. VLSI Syst., vol. 9, no. 6, pp. 813–823, 2001.
[8] D. Anderson and G. Metze, “Design of totally self-checking circuits for
m-out-of-n codes,” IEEE Trans. Comput., vol. C-22, no. 3, pp.
263–269, 1973.
[9] T. Pering, T. Burd, and R. Brodersen, “The simulation and evaluation
of dynamic voltage scaling algorithms,” in Proceedings of IEEE
International Symposium on Low Power Electronics and Design
(ISLPED’98), August 1998, pp. 76–81.
100
[10] A. Andrei, P. Eles, Z. Peng, M. T. Schmitz, and B. M. A. Hashimi,
“Energy optimization of multiprocessor systems on chip by voltage
selection,” IEEE Trans. VLSI Syst., vol. 15, no. 3, pp. 262–275, March
2007.
[11] A. Schmid and Y. Leblebici, “Robust circuit and system design
methodologies for nanometer-scale devices and single-electron
transistors,” IEEE Trans. VLSI Syst., vol. 12, no. 11, pp. 1156–1166,
November 2004.
[12] N. Bahar, R. Mundy, J. Patterson, and A. Zaslavsky, “Designing logic
circuits for probabilistic computation in the presence of noise,” in
Proceedings of the 42nd Annual Conference on Design and
Automation(DAC’05), Anaheim, CA, USA, June 2005, pp. 485–490.
[13] L. Chakrapani, P. Korkmaz, B. Akgul, and K. Palem, “Probabilistic
system-on-a-chip architectures,” ACM Transactions on Design
Automation of Electronic Systems (TODAES), vol. 12, no. 3, pp. 1–28,
August 2007.
[14] T. Austin and V. Bertacco, “Deployment of better than worst-case
design: Solutions and needs,” in Proceedings of the IEEE International
Conference on Computer Design (ICCD’05), Washington DC, USA,
2005, pp. 550–558.
[15] Z. Purser, K. Sundaramoorthy, and E. Rotenberg, “A study of
slipstream processors,” in Proceedings of the 33rd Annual ACM/IEEE
International Symposium on Microarchitecture, 2000, pp. 269–280.
[16] S. Sarangi, B. Greskamp, A. Tiwari, and J. Torrellas, “EVAL: Utilizing
processors with variation-induced timing errors,” in Proceeding of
the41st International Symposium on Microarchitecture (MICRO),
November 2008, pp. 423–434.
[17] S. Mukherjee, Architecture Design for Soft Errors. Burlington, MA:
Morgan Kaufmann, 2008.
[18] K.-H. Huang and J. A. Abraham, “Algorithm-based fault tolerance for
matrix operations,” IEEE Trans. Comput., vol. c-33, no. 6, pp.
518–528, June 1984.
[19] K. Pattabiraman, “Automated derivation of application-aware error
and attack detectors,” Ph.D. dissertation, University of Illinois at
Urbana-Champaign, Urbana, IL, 2009.
[20] E. Kim, R. Abdallah, and N. R. Shanbhag, “Soft NMR: exploiting
statistics for energy-efficiency,” in Proceedings of the 11th International
Symposium on System-on-Chip, 2009, pp. 52–55.
101
[21] J. Sartori, J. Sloan, and R. Kumar, “Fluid NMR – performing
power/reliability tradeoffs for applications with error tolerance,” in
USENIX Workshop on Power-aware Computing and Systems, October
2009.
[22] W. K. Jenkins, “Design of error checkers for self-checking residue
number arithmetic,” IEEE Trans. Comput., vol. C-32, no. 4, pp.
388–396, April 1983.
[23] T. Austin, “DIVA: a reliable substrate for deep submicron
microarchitecture design,” in Proceedings of the 32nd ACM/IEEE
International Symposium on Microarchitecture (MICRO’99), Haifa,
Israel, 1999, pp. 196–207.
[24] R. Hegde and N. R. Shanbhag, “Soft digital signal processing,” IEEE
Trans. VLSI Syst., vol. 9, no. 6, pp. 813–823, 2001.
[25] J. W. Choi, B. Shim, A. C. Singer, and N. I. Cho, “Low-power filtering
via minimum power soft error cancellation,” IEEE Trans. Signal
Process., vol. 55, no. 10, pp. 5084–5096, Oct. 2007.
[26] I. Akyildiz, W. Su, Y. Sankarasubramaniam, and E. Cayirci, “A survey
on sensor networks,” IEEE Commun. Mag., vol. 40, no. 8, pp.
102–114, 2002.
[27] D. Bertozzi and L. Benini, “Xpipes: a network-on-chip architecture for
gigascale systems-on-chip,” IEEE Circuits Syst. Mag., vol. 4, no. 2, pp.
18–31, 2004.
[28] S. Narayanan, G. V. Varatkar, D. L. Jones, and N. R. Shanbhag,
“Computation as estimation: Estimation-theoretic IC design improves
robustness and reduces power consumption,” in Proceedings of IEEE
International Conference on Acoustics, Speech and Signal Processing
(ICASSP’08), Las Vegas, USA, Apr. 2008, pp. 1421–1424.
[29] K. K. Parhi, VLSI Digital Signal Processing Systems: Design and
Implementation. New York, NY: Wiley-Interscience, 1999.
[30] S.-J. Oh and M. Suk, “Parallel algorithms for geometric searching
problems,” in Proceedings of ACM/IEEE Conference on
Supercomputing, Reno, USA, 1989, pp. 344–350.
[31] D. Tse and P. Viswanath, Fundamentals of Wireless Communication.
New York, NY: Cambridge University Press, 2005.
[32] B. Shim, S. Sridhara, and N. Shanbhag, “Reliable low-power digital
signal processing via reduced precision redundancy,” IEEE Trans.
VLSI Syst., vol. 12, no. 5, pp. 497–510, May 2004.
102
[33] H. V. Poor, An Introduction to Signal Detection and Estimation. New
York, NY: Springer, 1994.
[34] P. Huber, Robust Statistics. New York, NY: John Wiley & Sons, 1981.
[35] G. V. Varatkar, S. Narayanan, N. Shanbhag, and D. L. Jones,
“Stochastic networked computation,” IEEE Trans. VLSI Syst.,
vol. PP, no. 99, October 2009.
[36] D. Senderowics et al., “A 23mw 256-tap 8msample/s QPSK matched
filter for ds-cdma cellular telephony using recycling integrator
correlators,” in Proceedings of IEEE International Solid-State Circuits
Conference, ISSCC, February 2000, pp. 354–355.
[37] C. L. Lee and C. W. Jen, “A bit-sliced median filter design based on
majority gate,” IEE Proceedings G Circuits, Devices, and Systems, vol.
139, no. 1, pp. 63–71, 1992.
[38] N. Cressie, “M -estimation in the presence of unequal scale,” Statistica
Neerlandica, vol. 34, pp. 19–32, 1980.
[39] N. R. Draper and H. Smith, Applied Regression and Analysis. New
York, NY: Wiley Series in Probability and Statistics, 1998.
[40] S. Kassam and H. Poor, “Robust techniques for signal processing: A
survey,” Proceeding of the IEEE, vol. 73, no. 3, pp. 433–482, March
1985.
[41] S. Morgenthaler, “A survey of robust statistics,” Statistical Methods
and Applications, vol. 15, no. 3, pp. 271–293, 2007.
[42] W. B. Wu, “M -estimation of linear models with dependent errors,”
The Annals of Statistics, vol. 35, no. 2, pp. 495–521, 2007.
[43] G. Masrotto, “Robust and consistent estimates of
autoregressive-moving average parameters,” Biometrika, vol. 74, no. 4,
pp. 791–197, December 1987.
[44] S. Portnoy, “Robust estimation in dependent situations,” Ann. Statist.,
vol. 5, no. 1, pp. 22–43, 1977.
[45] C. Field and D. Wiens, “One-step M -estimators in the linear model,
with dependent errors,” Canadian Journal of Statistics, vol. 22, no. 2,
pp. 219–231, 1994.
[46] X. Peiliang, “On robust estimation with correlated observations,” Bull.
Geo´d., vol. 63, no. 3, pp. 237–252, 1989.
103
[47] Y. Yang, L. Song, and T. Xu, “Robust estimator for correlated
observations based on bifactor equivalent weights,” Journal of
Geodesy, vol. 76, no. 6–7, pp. 353–358, 2002.
[48] S. Hekimoglu and M. Berber, “Effectiveness of robust methods in
heterogeneous linear models,” Journal of Geodesy, vol. 76, no. 11-12,
pp. 706–713, March 2003.
[49] D. Bertsekas, Nonlinear Programming, 2nd ed. Belmont, MA: Athena
Scientific, 1999.
[50] Y. Eisenberg et al., “Joint source coding and transmission power
management for energy efficient wireless video communications,” IEEE
Trans. Circuits Syst. Video Technol., vol. 12, no. 6, pp. 411–424, June
2002.
[51] I. S. Chong and A. Ortega, “Dynamic voltage scaling algorithms for
power constrained motion estimation,” in Proc. IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP’07),
vol. 2, April 2007, pp. 101–104.
[52] S. Appadwedula, M. Goel, N. R. Shanbhag, D. L. Jones, and
K. Ramchandran, “Total system energy minimization for wireless
image transmission,” The Journal of VLSI Signal Processing, vol. 27,
no. 1-2, pp. 99–117, February 2001.
[53] T.-C. Chen, Y.-H. Chen, S.-F. Tsai, S.-Y. Chien, and L.-G. Chen,
“Fast algorithm and architecture design of low-power integer motion
estimation for H.264/AVC,” IEEE Trans. Circuits Syst. Video
Technol., vol. 17, no. 5, pp. 568–577, May 2007.
[54] G. V. Varatkar, S. Narayanan, N. R. Shanbhag, and D. L. Jones,
“Trends in energy-efficiency and robustness using stochastic sensor
network-on-a-chip,” in Proceedings of the 18th ACM Great Lakes
symposium on VLSI (GLSVLSI’08), Orlando, Florida, USA, 2008, pp.
351–354.
[55] G. V. Varatkar and N. R. Shanbhag, “Error-resilient motion estimation
architecture,” IEEE Trans. VLSI Syst., vol. 16, no. 10, pp. 1399–1412,
October 2008.
[56] Joint Video Team, “JM reference software JM10.2,”
http://iphome.hhi.de/suehring/tml/, August 2007.
[57] IBM Process Design Manual, IBM Corporation, Armonk, NY, May
2004.
104
[58] S. Narayanan, J. Sartori, R. Kumar, and D. Jones, “Scalable stochastic
processors,” in Proceedings of Design, Automation and Test in Europe,
DATE’10, Dresden, Germany, to be published.
[59] “ITRS 2008 update,” International Technology Roadmap for
Semiconductors, Tech. Rep., 2008. [Online]. Available:
http://www.itrs.net
[60] S. Herbert and D. Marculescu, “Variation-aware dynamic
voltage/frequency scaling,” in Proceedings of the 15th IEEE
International Symposium on High-Performance Computer Architecture
(HPCA’09), February 2009, pp. 301–312.
[61] J. Patel, “CMOS process variations: A critical operation point
hypothesis,” April 2008, unpublished. [Online]. Available:
http://www.stanford.edu/class/ee380/Abstracts/080402-jhpatel.pdf
[62] R. Kumar, “Stochastic processors,” presented at the NSF Workshop
on Science of Power Management, March 2009. [Online]. Available:
http://passat.crhc.illinois.edu/rakeshk/nsf workshop final.pdf
[63] S. Narayanan, G. Lyle, R. Kumar, and D. Jones, “Testing the critical
operating point (COP) hypothesis using FPGA emulation of timing
errors in over-scaled soft-processors,” in SELSE 5 Workshop - Silicon
Errors in Logic - System Effects, March 2009.
[64] S. Ghosh et al., “A novel low overhead fault tolerant Kogge-Stone
adder using adaptive clocking,” in Proceedings of the Conference on
Design, Automation and Test in Europe, DATE, 2008, pp. 366–371.
[65] C. T. Kong, “Study of voltage and process variations impact on the
path delays of arithmetic units,” M.S. thesis, University of Illinois at
Urbana-Champaign, Urbana, IL, 2008.
[66] A. Kahng, S. Kang, R. Kumar, and J. Sartori, “Slack redistribution for
graceful degradation under voltage overscaling,” in Proceedings of the
15th IEEE/SIGDA Asia and South Pacific Design and Automation
conference (ASPDAC), Taipei, Taiwan, 2010, pp. 825–831.
[67] A. Kahng, S. Kang, R. Kumar, and J. Sartori, “Designing processors
from the ground up to allow voltage/reliability tradeoffs,” in
Proceedings of the 16th IEEE International Symposium on
High-Performance Computer Architecture (HPCA-2010), Bangalore,
India, 2010, pp. 1–11.
[68] J. Bau et al., “Error resilient system architecture ERSA for
probabilistic applications,” in IEEE Workshop on Silicon Errors in
Logic - System Effects, SELSE, 2007.
105
[69] J.-W. van de Waerdt et al., “The TM3270 media-processor,” in
Proceedings of the 38th IEEE/ACM International Symposium on
Microarchitecture, Barcelona, Spain, 2005, pp. 331–342.
[70] Synopsys Design Compiler User’s Manual, Synopsys, Inc., 2009.
[Online]. Available: http://www.synopsys.com
[71] Cadence SoC Encounter User’s Manual, Cadence, Inc., 2009. [Online].
Available: http://www.cadence.com
[72] Cadence Signal Storm User’s Manual, Cadence, Inc., 2009. [Online].
Available: http://www.cadence.com
[73] B. Zhai et al., “Energy-efficient subthreshold processor design,” IEEE
Trans. VLSI Syst., vol. 17, no. 8, pp. 1127–1137, August 2009.
106
