High-Level Performance Estimation Framework for FPGA-based Soft Processors by Powell, Adam
Imperial College London
Department of Electronics and Electrical Engineering
High-Level Performance Estimation








Submitted in part fulfilment of the requirements for the degree of
Doctor of Philosophy in Electronics Engineering of Imperial College London
Declaration
I herewith certify that all material in this dissertation which is not my own work has
been properly acknowledged.
The copyright of this thesis rests with the author and is made available under a Creative
Commons Attribution Non-Commercial No Derivatives licence. Researchers are free
to copy, distribute or transmit the thesis on the condition that they attribute it, that
they do not use it for commercial purposes and that they do not alter, transform or
build upon it. For any reuse or redistribution, researchers must make clear to others the




During the design of complex systems, designers need to know how their algorithm
or hardware is going to perform early in the design process. Tools exist to predict
performance metrics based on low-level parameters which are difficult to extract and are
dependent on specific implementation or architecture details which are only available
in the later stages of design. There needs to exist a model that is able to predict
performance metrics based on the algorithm being executed and the architecture it is
being executed on while using easily extractable parameters available early in design.
This thesis introduces a framework for designers that assists them in the early stages
of design. By having early estimations of performance based on the underlying hardware,
greater savings can be achieved when compared to other methods which can only occur
late in the design stage. Soft processors are used in the construction of the predictive
models as they are flexible and allow for complex models to be created that explore the
relationship between algorithm and hardware parameters.
First, an accurate model for performance estimation is developed that uses both
algorithm and architecture parameters. The method for extracting meaningful parameters
of algorithms without the need for implementation is described and forms an important
basis for this work. In predicting FPGA core power and off-chip device power, the
model performs well with mean errors under 2%, while the error is slightly higher when
predicting execution time.
Next, a framework is proposed that uses this accurate model to analyze the performance
3
of the algorithm in question to give the designer useful guidance not present in existing
state-of-the-art approaches. The framework allows the user to see the interaction between
the algorithm and the underlying hardware. This allows for early design space exploration
that can produce more efficient hardware. Sensitivity analysis is performed in order to
assess the performance of the proposed framework under noisy input parameters that
model user uncertainty. Further, the properties of the modeling technique are used to
provide the user with a measure of prediction confidence. Finally, the framework’s ability
to predict the effect of single event upsets in the arithmetic hardware is examined. This
is done by creating additional predictive models to examine the execution time cost in





1.1 Aims of this work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.2 Published work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
1.3 Statement of original contribution . . . . . . . . . . . . . . . . . . . . . 24
1.4 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2 Literature Review 25
2.1 Performance Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.1.1 Predicting Power Consumption . . . . . . . . . . . . . . . . . . . 26
Transistor-Level Methods . . . . . . . . . . . . . . . . . . . . . . 27
Gate-Level Methods . . . . . . . . . . . . . . . . . . . . . . . . . 28
Microarchitecture-Level Methods . . . . . . . . . . . . . . . . . . 29
Functional Level Power Analysis Methods . . . . . . . . . . . . . 30
Instruction Level Power Analysis Methods . . . . . . . . . . . . . 31
Specification-Level Methods . . . . . . . . . . . . . . . . . . . . . 33
2.1.2 Predicting Execution Time . . . . . . . . . . . . . . . . . . . . . 35
2.2 High-Level Parameter Extraction . . . . . . . . . . . . . . . . . . . . . . 39
2.3 Design Space Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.4 Predicting Effects of Single Event Upsets . . . . . . . . . . . . . . . . . 42
5
2.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3 Background 46
3.1 Classification and Regression Trees . . . . . . . . . . . . . . . . . . . . . 47
3.1.1 Building a Classification Tree . . . . . . . . . . . . . . . . . . . . 47
From Classification to Regression . . . . . . . . . . . . . . . . . . 50
3.2 Advantages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.3 Disadvantages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.4 Regression Forests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4 Extraction of High-Level Parameters 58
4.1 Algorithms Used . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.1.1 Domain Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.1.2 Algorithm Overview . . . . . . . . . . . . . . . . . . . . . . . . . 59
JPEG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
JPEG 2000 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
JPEG XR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
Vector Quantization . . . . . . . . . . . . . . . . . . . . . . . . . 60
Quad-Tree Fractal Compression . . . . . . . . . . . . . . . . . . . 60
WebP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.1.3 Algorithm Decomposition . . . . . . . . . . . . . . . . . . . . . . 61
Discrete Cosine Transform . . . . . . . . . . . . . . . . . . . . . . 64
Huffman Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
Discrete Wavelet Transform . . . . . . . . . . . . . . . . . . . . . 66
Embedded Block Coding with Optimal Truncation . . . . . . . . 68
Photo Core Transform . . . . . . . . . . . . . . . . . . . . . . . . 69
JPEG XR Block Coding . . . . . . . . . . . . . . . . . . . . . . . 69
Vector Quantization . . . . . . . . . . . . . . . . . . . . . . . . . 70
6
Quad-Tree Fractal Compression . . . . . . . . . . . . . . . . . . . 72
WebP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.2 Parameter Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.2.1 Algorithm Parameters . . . . . . . . . . . . . . . . . . . . . . . . 75
Type I Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . 76
Type II Parameters . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.2.2 Soft-Processor Architecture Parameters . . . . . . . . . . . . . . 79
4.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5 Construction of a High-Level Performance Model 83
5.1 Data Acquisition and Experiment Setup . . . . . . . . . . . . . . . . . . 84
5.1.1 Soft Processor Configuration . . . . . . . . . . . . . . . . . . . . 84
5.1.2 Power Measurement . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.1.3 Execution Time Measurement . . . . . . . . . . . . . . . . . . . . 87
Single-core . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
Dual-core . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.1.4 Algorithm Block Execution . . . . . . . . . . . . . . . . . . . . . 90
5.1.5 Training Data Generation: Single-core . . . . . . . . . . . . . . . 91
5.1.6 Training Data Generation: Dual-core . . . . . . . . . . . . . . . . 91
5.2 Model Validation for Single-Core Prediction Model . . . . . . . . . . . . 93
5.2.1 Validation Method . . . . . . . . . . . . . . . . . . . . . . . . . . 93
5.2.2 FPGA Core Power . . . . . . . . . . . . . . . . . . . . . . . . . . 98
5.2.3 Off-chip Device Power . . . . . . . . . . . . . . . . . . . . . . . . 102
5.2.4 Execution Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
5.3 Model Validation for Dual-Core Prediction Model . . . . . . . . . . . . . 111
5.3.1 Validation method . . . . . . . . . . . . . . . . . . . . . . . . . . 111
5.3.2 FPGA Core Power . . . . . . . . . . . . . . . . . . . . . . . . . . 114
7
5.3.3 Off-Chip Device Power . . . . . . . . . . . . . . . . . . . . . . . . 116
5.3.4 Execution Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
5.4 Sensitivity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
5.4.1 Method of Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 123
5.4.2 FPGA Core Power . . . . . . . . . . . . . . . . . . . . . . . . . . 127
5.4.3 Off-Chip Device Power . . . . . . . . . . . . . . . . . . . . . . . . 129
5.4.4 Execution Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
5.5 Model Benchmarking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
5.5.1 Algorithm Optimizations and the DCT . . . . . . . . . . . . . . 134
5.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
6 A Framework for Design Guidance using a High-Level Model 140
6.1 Architecture Design Space Exploration . . . . . . . . . . . . . . . . . . . 141
6.1.1 Cache Size Optimization . . . . . . . . . . . . . . . . . . . . . . . 141
6.1.2 Logic Element Optimization . . . . . . . . . . . . . . . . . . . . . 146
6.2 Providing Prediction Confidence . . . . . . . . . . . . . . . . . . . . . . 150
6.2.1 Measures for Prediction Confidence . . . . . . . . . . . . . . . . . 150
Node Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
Outlier Measure . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
6.2.2 Prediction Confidence for Execution Time . . . . . . . . . . . . . 152
6.2.3 Prediction Confidence for Core Power . . . . . . . . . . . . . . . 156
6.2.4 Prediction Confidence for Device Power . . . . . . . . . . . . . . 159
6.3 Estimating Effects of Single Event Upsets . . . . . . . . . . . . . . . . . 162
6.3.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
6.3.2 Effects on Multipliers . . . . . . . . . . . . . . . . . . . . . . . . 164
Providing Design Guidance . . . . . . . . . . . . . . . . . . . . . 165
Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
8
6.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
7 Conclusions and Future Work 172
7.1 Summary of conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
7.2 Directions for future work . . . . . . . . . . . . . . . . . . . . . . . . . . 175
9
Table of Figures
1.1 Framework overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.1 Levels of abstraction and their trends . . . . . . . . . . . . . . . . . . . 27
3.1 Sample classification tree . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.1 Images used for testing compression times . . . . . . . . . . . . . . . . . 62
4.2 Test image used in data collection . . . . . . . . . . . . . . . . . . . . . 63
4.3 Example JPEG decomposition with relevant algorithm parameters . . . 65
4.4 6-level DWT dyadic decomposition . . . . . . . . . . . . . . . . . . . . . 68
4.5 Example of quad-tree partioning . . . . . . . . . . . . . . . . . . . . . . 72
5.1 Core power measurement delta distribution . . . . . . . . . . . . . . . . 86
5.2 Off-chip device power measurement delta distribution . . . . . . . . . . 86
5.3 Gprof time measurement distribution . . . . . . . . . . . . . . . . . . . . 88
5.4 Performance counter time measurement distribution . . . . . . . . . . . 89
5.5 Single-core: Core power measurement distribution . . . . . . . . . . . . 94
5.6 Single-core: Device power measurement distribution . . . . . . . . . . . 95
5.7 Single-core: Execution time measurement distribution . . . . . . . . . . 96
5.8 Error histograms of test sets for single-core power predictions for absolute
error (a) and relative error (b) . . . . . . . . . . . . . . . . . . . . . . . 100
10
5.9 Empirical CDFs of test sets for single-core power predictions for absolute
error (a) and relative error (b) . . . . . . . . . . . . . . . . . . . . . . . 100
5.10 Error histograms of test sets for single-core device power predictions for
absolute error (a) and relative error (b) . . . . . . . . . . . . . . . . . . 104
5.11 Empirical CDFs of test sets for single-core device power predictions for
absolute error (a) and relative error (b) . . . . . . . . . . . . . . . . . . 104
5.12 Error histograms of test sets for single-core execution time predictions for
absolute error (a) and relative error (b) . . . . . . . . . . . . . . . . . . 108
5.13 Empirical CDFs of test sets for single-core execution time predictions for
absolute error (a) and relative error (b) . . . . . . . . . . . . . . . . . . 108
5.14 Dual core: Core power measurement distribution . . . . . . . . . . . . . 111
5.15 Dual core: Device power measurement distribution . . . . . . . . . . . . 112
5.16 Dual core: Execution time measurement distribution . . . . . . . . . . . 113
5.17 Error histograms of test sets for dual-core power predictions for absolute
error (a) and relative error (b) . . . . . . . . . . . . . . . . . . . . . . . 115
5.18 Empirical CDFs of test sets for dual-core power predictions for absolute
error (a) and relative error (b) . . . . . . . . . . . . . . . . . . . . . . . 115
5.19 Error histograms of test sets for dual-core device power predictions for
absolute error (a) and relative error (b) . . . . . . . . . . . . . . . . . . 118
5.20 Empirical CDFs of test sets for dual-core device power predictions for
absolute error (a) and relative error (b) . . . . . . . . . . . . . . . . . . 118
5.21 Error histograms of test sets for dual-core execution time predictions for
absolute error (a) and relative error (b) . . . . . . . . . . . . . . . . . . 121
5.22 Empirical CDFs of test sets for dual-core execution time predictions for
absolute error (a) and relative error (b) . . . . . . . . . . . . . . . . . . 121
5.23 Example parameter estimation for Block B arithmetic operations . . . . 125
5.24 Adjacent values for Block B arithmetic operations . . . . . . . . . . . . 125
11
5.25 Range of values for JXR coding sensitivity analysis . . . . . . . . . . . . 126
5.26 Core power: Change in predictions for misestimated parameters . . . . . 128
5.27 Core power: Relative change in predictions for misestimated parameters 128
5.28 Device power: Change in predictions for misestimated parameters . . . 130
5.29 Device power: Relative change in predictions for misestimated parameters 130
5.30 Execution time: Change in predictions for misestimated parameters . . 131
5.31 Execution time: Relative change in predictions for misestimated parameters132
5.32 Single-core DCTs: Execution time Spearman’s rho value distribution . . 136
5.33 Single-core DCTs: Core power Spearman’s rho value distribution . . . . 137
5.34 Single-core DCTs: Device power Spearman’s rho value distribution . . . 137
6.1 Optimal architectures: Total power versus execution time by M9K block
usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
6.2 Predicted optimums: Core power versus M9K block usage . . . . . . . . 143
6.3 Predicted optimums: Device power versus M9K block usage . . . . . . . 144
6.4 Predicted optimums: Total power versus M9K block usage . . . . . . . . 145
6.5 Predicted optimums: Execution time versus M9K block usage . . . . . . 146
6.6 Optimal architectures: Total power versus execution time by LE usage . 147
6.7 Optimal architectures: Core power versus LE usage . . . . . . . . . . . . 148
6.8 Optimal architectures: Device power versus LE usage . . . . . . . . . . 148
6.9 Optimal architectures: Total power versus LE usage . . . . . . . . . . . 149
6.10 Optimal architectures: Execution time versus LE usage . . . . . . . . . 149
6.11 Execution time: Training set outlier measures histogram . . . . . . . . . 153
6.12 Execution time: Absolute error versus outlier measure . . . . . . . . . . 154
6.13 Execution time: Mean, median, and IQR versus outlier measure . . . . . 154
6.14 Execution time: Absolute error versus node probability . . . . . . . . . 155
6.15 Core power: Training set outlier measures . . . . . . . . . . . . . . . . . 156
12
6.16 Core power: Absolute error versus outlier measure . . . . . . . . . . . . 157
6.17 Core power: Mean, median, and IQR versus outlier measure . . . . . . . 157
6.18 Core power: Absolute error versus node probability . . . . . . . . . . . . 158
6.19 Device power: Training set outlier measures . . . . . . . . . . . . . . . . 159
6.20 Device power: Absolute error versus outlier measure . . . . . . . . . . . 160
6.21 Device power: Mean, median, and IQR versus outlier measure . . . . . . 160
6.22 Device power: Absolute error versus node probability . . . . . . . . . . . 161
6.23 SEU execution time cost estimation of a single architecture . . . . . . . 165
6.24 SEU cost estimate of optimal architectures . . . . . . . . . . . . . . . . 166
6.25 Normalized SEU cost estimate of optimal architectures . . . . . . . . . . 167
13
Table of Tables
2.1 Power model abstraction levels with input requirements . . . . . . . . . 27
4.1 Compression times for different algorithms over three images . . . . . . 63
4.2 Standard compression blocks used for modeling . . . . . . . . . . . . . . 74
4.3 Type I algorithm parameters . . . . . . . . . . . . . . . . . . . . . . . . 76
4.4 Type II algorithm parameters . . . . . . . . . . . . . . . . . . . . . . . . 79
4.5 NIOS 2 parameter summary . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.1 Execution time: Membership of algorithms in “tail” . . . . . . . . . . . 96
5.2 Single core: Core power per algorithm fold performance . . . . . . . . . 99
5.3 Single core: Core power ensemble variable importance . . . . . . . . . . 102
5.4 Single core: Device power per algorithm fold performance . . . . . . . . 103
5.5 Single core: Off-chip device power ensemble variable importance . . . . 106
5.6 Single core: Execution time per algorithm fold performance . . . . . . . 107
5.7 Single core: Execution time ensemble variable importance . . . . . . . . 109
5.8 Dual core: Core power per algorithm fold performance . . . . . . . . . . 114
5.9 Dual core: Core power ensemble variable importance . . . . . . . . . . . 116
5.10 Dual core: Device power per algorithm fold performance . . . . . . . . . 117
5.11 Dual core: Off-chip device power ensemble variable importance . . . . . 119
5.12 Dual core: Execution time per algorithm fold performance . . . . . . . . 120
5.13 Dual core: Execution time ensemble variable importance . . . . . . . . . 122
14
5.14 Ensemble variable importance: Single core models . . . . . . . . . . . . 124
5.15 Core power sensitivity: Percentiles for absolute and relative changes . . 129
5.16 Device power sensitivity: Percentiles for absolute and relative changes . 131
5.17 Execution time sensitivity: Percentiles for absolute and relative changes 133
5.18 DCT algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
6.1 Test vector distribution statistics . . . . . . . . . . . . . . . . . . . . . . 155
6.2 Core power: Test vector distribution statistics . . . . . . . . . . . . . . . 158
6.3 Execution time per multiplier performance . . . . . . . . . . . . . . . . . 168
6.4 Execution time cost error performance . . . . . . . . . . . . . . . . . . . 170
15
Acknowledgements
Foremost I’d like to thank my wonderful fiance´e Alexis for her endless help, support, and
patience throughout this entire process. Without her I wouldn’t have been able to do
this and I can’t thank her enough. My family has also given me their support throughout
my PhD as well as my decision to move to the UK. They’ve helped me a great deal and
made the transition much easier. I’d also like to thank Alexis’ family–Don, Jen, and
Gareth–for everything they’ve done for me over the years during my PhD.
Secondly I’d like to thank all of my amazingly brilliant colleagues from Imperial
College London whose insight and knowledge helped me through both the academic and
non-academic parts of my PhD. And special thanks to David Jones for everything he’s
done, from helping early on to the proofreading on my thesis.
Finally I’d like to thank my supervisors Peter Y.K. Cheung and Christos Savvas-
Bouganis for their support and guidance.
16
1 Introduction
Embedded applications are becoming increasingly important as consumer demands are
pushing electronic devices to become more ubiquitous. At the same time, applications are
becoming more computationally demanding which is increasing power consumption and
therefore reducing battery life. Consumers, however, want more power efficient hardware
with an end result of longer battery life or faster computations. One way to increase
the efficiency of hardware is through the use of modeling techniques. By representing
aspects of the system using a model, behavior of the system or the application running
on it can be determined prior to implementation and ultimately changed to increase
efficiency. The model described in this work allows designers to examine how their
processor architecture and application will perform without the need to fully implement
their design. By knowing how their processor architecture and application will perform,
designers can produce more efficient hardware earlier in the design process. These gains
in efficiency can be increased by moving the modeling process earlier in the design
process where there is greater scope for changes in the design. According to Rabaey [76],
early optimizations can result in over 70% energy savings; this reduces quickly for each
stage that is later in design.
Soft processors allow for quick design and use of microprocessors for embedded
systems. Though possible on a number of technologies, this work focuses on soft
processors implemented on Field Programmable Gate Arrays (FPGAs). The use of
soft processors allow for customizable processors that can be changed according to the
17
requirements of the application, domain, or budget. Additionally, soft processors allow
for the relationship between algorithms and the architectures they run on to be explored.
This would be not be possible, at least not to the degree here, with conventional hard
processors.
Due to the large number of parameters available to customize soft processors, explo-
ration of this architecture space is a lengthy process. For only a basic set of parameters
(cache sizes and multiplier types), the generation of all possible architectures can take 25
days for the Altera NIOS 2; each architecture combination takes roughly six minutes to
generate on a quad Core 2 machine which was used for this work. Including other useful
components, such as a Memory Management Unit (MMU) or a Memory Protection Unit
(MPU), can increase generation time by as much as three to six orders of magnitude.
At this point, the desired application must be executed on each combination of
architecture parameters in order to find more efficient architectures. This is a time-
consuming process which must be done for each application that needs to be implemented.
Ultimately it becomes a problem of selecting the correct architecture for an application
in a very large parameter space; the “correct” architecture for a system can be the most
resource efficient architecture, the fastest, or the one that consumes the least amount of
power. If the parameter space is sampled and only a manageable number of combinations
are tested, there is a high probability that the solution will be less efficient one. If system
designers could have access to a tool that provides more efficient architectures early in
the design process with only a small amount of time dedicated to the exploration, large
gains could be had in efficiency as well as a decrease in the total cost.
18
1.1 Aims of this work
By having a profile of power consumption and execution time as a function of hardware
architecture parameters and high-level parameters that capture the main computational
and memory access characteristics of the algorithm, early design space exploration
can be performed as well as providing the user with design guidance. The informal
question driving this work was “How early can you actually predict power consumption
and execution time accurately?” If these algorithm parameters are high level enough,
they can be extracted or estimated by the designer without the need for an actual
implementation. However, to do this the algorithm space must be constrained in order
to be able to extract these high-level parameters. Ultimately this work allows for earlier
estimates than previous work at the expense of a constrained application space.
Previous work on performance modeling varies greatly as you move from low-levels
of abstraction to higher ones. The lowest levels of abstraction–such as those at the
transistor or gate level–require knowledge of the circuit layout of the hardware in question
as well as considerable computation resources as the solutions typically involve solving
systems of partial differential equations. A higher level of abstraction–such as the
register-transfer level–require detailed knowledge of individual signal switching activity
due to the application being executed. At the highest levels of abstraction–such as the
instruction level–detailed analysis of the source code is performed to predict the power
consumption of individual instructions as well as sets of instructions. These methods are
flexible in terms of specific applications but are very dependent on the characterization
of the microprocessor as well as the specific configuration of the whole system. Other
methods abstract to the specification level, using parameters such as cache size and
other architecture parameters to predict power. However, these methods are limited in
terms of numbers of architecture parameters used or the range of application parameters
used. These approaches and drawbacks will be discussed later in the literature review.
19
This work seeks to, ultimately, provide a tool for system designers that allows for
earlier performance estimates compared to previous work as well as design guidance
at the specification level using descriptive parameters of the applications. By using
this tool, designers can forgo the lengthy generation and data collection process to get
performance estimates of their hardware and application. Note that “performance” is
used here to mean both power consumption and execution time.
In order to provide early estimates, there must be some information about the
application that will be executed. As mentioned above, if these algorithm parameters
are high level enough, they can be easily extracted by the designer early in the design
process. The basic use of the tool is to provide performance estimates using both
architecture and these high-level algorithm parameters as inputs; the output would be
the power consumption and execution time of each architecture combination which gives
the designer an early estimate of these metrics. For the designer, this means time and
money have been saved in addition to the efficiency gains that accompany this early
modeling. To do this, the performance of the system must be characterized. Machine
learning techniques are used here to capture the effects of the soft-processor architecture
parameters and the application parameters in order to predict performance.
Here, a high-level parameter is defined as a parameter of the algorithm or architecture
that is easily obtained, usually before implementation of the system itself. Note that
while these are easy to obtain, the effect of these parameters is significantly more difficult
to discover. Architecture parameters include concepts such as cache size, cache page
replacement algorithm, pipeline depth, or multiplier type. Algorithm parameters include
numbers of arithmetic or memory operations, ratio of multiply operations to additions,
and the presence of division operations or floating-point operations.
The tool is then extended to provide additional functionality to the designer. In
addition to typical concerns such as power consumption and execution time, the tool
allows designers to see how architecture parameters affect other metrics as well. First,
20
the designer is able to see the effect of architecture parameters on resource usage and
optimize for this concern. This allows for efficient use of resources which means resources
can be used on other aspects of the system that would otherwise be wasted on the
application in question. Second, the designer is able to see how the choice of architecture
affects a growing concern in electronics: single event upsets (SEUs). An SEU is a change
of state in a memory cell that is caused by ionizing radiation and can effect incorrect
behavior in hardware. SEUs are a growing concern due, in part, to shrinking process sizes
and operating voltages (discussed in detail later). It is therefore increasingly important
that designers are aware of such phenomenon and how SEUs can affect the behavior of
their system.
For this work, the scope is limited to only the prediction of power and performance
for image compression methods running on FPGA-based soft processors. The use of
high-level algorithm parameters constrains the domain to algorithms and applications
which share the same execution characteristics. Therefore the ability of this technique to
be generalized to other application domains is limited. However, architecture parameters–
such as cache size–are shared by all processors. To this end, the flexibility of soft
processors can be used to generate power and performance models that can be applied
more generally to soft processors as well as hard processors. The use of soft processors
allows for the relationship between architecture and application to be explored. How
this work can be generalized is discussed in the evaluation of the predictive performance
of the model.
An overview of such a system is shown in Figure 1.1.
Ultimately this work will result in a tool for designers that allows them to obtain
predictions of performance of their algorithms as a function of their algorithm. Further,
the designer will be able to see how the architecture parameters affect these metrics,
allowing for early design space exploration. To find the most efficient architecture, the































Figure 1.1: Framework overview
and money to do. Using this tool allows designer to forgo this process, allowing for fast,
cheap design space exploration within the confines of the tool. As the tool is based
on a pre-generated model, the designer will not require the computing resources or
measurement tools and capabilities in order to obtain power consumption and execution
time.
This work assumes nothing about the user’s system requirements or constraints. That
is, more efficient architectures can be found for power versus execution time but these
architectures can vary greatly in the number and type of resources used. Therefore this
tool will allow users to examine how the power and performance of the system changes
with respect to design choices and the resource implications of those design choices.
This work involves three parts: parameter selection, performance evaluation, and the
application of the model to a framework for design space exploration.
22
First, parameter selection is the process of extracting the previously mentioned high-
level parameters from any given application. These parameters also include architecture
parameters of the soft processor.
Performance evaluation is the construction and validation of a model using these
parameters. A model is generated which is based on a machine learning technique called
a regression tree. The model is then validated and analyzed for its correctness.
Lastly, a framework is constructed that uses this model to perform design space
exploration and provide useful design guidance.
1.2 Published work
This work is supported by two peer-reviewed publications. The first [72] is published
in the proceedings of the 2012 Field Programmable Logic conference, entitled “Early
performance estimation of image compression methods on soft processors.” This work
is the first publication of this work and discusses high-level parameter extraction of
algorithms in addition to evaluation of the model performance.
The second [73] is published in the Journal of Systems Architecture, entitled “High-
Level Power and Performance Estimation of FPGA-based Soft Processors and its Ap-
plication to Design Space Exploration.” This work expands on the previous work by
improving the parameter extraction and increasing the number of algorithms used which
lead to an increase in model accuracy. Also, the concepts of prediction confidence and
design space exploration were examined.
23
1.3 Statement of original contribution
The original contributions of this work, as supported by the published work, are
• extraction of effective high-level parameters of algorithms for a single domain;
• construction of a performance model using these parameters for single- and dual-
core soft-processor systems;
• providing sensitivity analysis of this model based on human error in parameter
estimation;
• performing design space exploration based on resource usage and its impact on
performance;
• providing prediction confidence for individual predictions based on the relationship
between input vector and training data; and
• estimating effects of single event upsets on arithmetic hardware.
1.4 Outline
This thesis is organized as follows.
Chapter 2 reviews and analyzes previous work done in the fields relevant to this thesis.
Chapter 3 covers relevant technical background information.
Chapter 4 discusses the method developed for high-level parameter extraction.
Chapter 5 discusses model construction and model validation.
Chapter 6 discusses the use of the model constructed in the previous chapter to form
a framework to be used by designers to assist in the design process.
Chapter 7 concludes the thesis and discusses directions for future work.
24
2 Literature Review
This literature review focuses on previous work done with respect to the contributions
set forth by this thesis. As this thesis covers a relatively diverse set of topics, this review
will be divided such that each section will relate to the various contributions of the thesis
itself. By doing this, it can be seen how the thesis contributes to the different topics.
First, previous methods of modeling both power consumption and execution time
will be discussed. This section discusses the various methods of modeling these metrics
and how they relate to the design process. Low-level methods of modeling will be
discussed first with the discussion progressing to higher levels of abstraction. Power
consumption and execution time will be separated here as the modeling methods can
differ significantly. However, power consumption and execution time are both highly
correlated to the application being executed so this aspect will be discussed in context
with the previous work.
After the existing methods of prediction are discussed, how these methods extract and
use parameters of algorithms will be discussed.
Second, it will be discussed how modeling methods have been used for design space
exploration (DSE) purposes. Then expanding on this, previous methods for predicting




2.1.1 Predicting Power Consumption
The power consumption of devices has always been of interest to engineers with the
modeling of this power consumption being equally important. There are many levels of
abstraction by which power consumption can be modeled, ranging from the transistor
level and going all the way to the architecture specification level.
The different levels of modeling impose different demands on the user of these tools. For
instance, low-level methods require detailed knowledge of the underlying implementation
of hardware. In many cases this is not available for commercial off-the-shelf (COTS)
hardware. Further, the type of analysis that is performed is equally diverse among the
different levels of abstraction. This work is concerned with high-level methods that can
predict the power consumption of large systems; attempting to predict whole system
power using low-level methods would not be done in practice and would be virtually
impossible in practical terms. Low-level methods–transistor, gate, and architecture–are
included here, albeit briefly, for completeness.
Figure 2.1 shows the different levels of abstraction and their relationship to accuracy,
speed of analysis, stage of design, and typical system sizes. FLPA and ILPA refer to
Functional Level Power Analysis and Instruction Level Power Analysis which are both
high-level methods that will be discussed later. Higher levels of abstraction mean the
analysis is faster to perform on larger systems at an earlier stage of design at the cost of
accuracy. Another aspect is the possible energy savings that can be gained by having
predictive models at each of these abstraction levels; once transistor and gate-level
descriptions of a circuit are obtained, it is often too late to fix any high power problems
the circuit might have. To this end, having early estimates are much more useful to
designers [37].
The concept of interest here is the stage of design at which the analysis can be
26
Figure 2.1: Levels of abstraction and their trends
Table 2.1: Power model abstraction levels with input requirements
Input requirements
Abstraction Transistor Gate Signal Component Source Architecture Algorithm
level layout layout activity activity code parameters parameters
Transistor X X X X X X X
Gate X X X X X X
µ-architecture X X X X X
FLPA X X X
ILPA X X X
Specification X X
performed; it will be shown later that the level of accuracy does not decrease much even
at the highest level of abstraction. Table 2.1 shows the different levels of abstraction
and their required inputs.
Transistor-Level Methods
Late in the design phase of VLSI circuits, the circuit layout and nets are available and
transistor-level power analysis can be performed. However, power analysis in this way is
done in the continuous time domain using linearized differential equations [42] which
results in a computationally demanding method of analysis which makes it unrealistic to
27
calculate for any non-trivial number of transistors.
Popular tools for transistor-level power analysis are PowerMill [40] and SPICE [63]
which can be used to estimate the power consumption of a single transistor or small
circuit defined by its transistor layout. Further, they can also be used to estimate the
power of larger circuits such as flip-flops or amplifiers.
The main advantage of transistor-level techniques is their accuracy, providing detailed
information on voltages and currents throughout its period of operation albeit at a
very late stage in design. Further, the requirements of transistor-level models preclude
them from performing early design stage power estimations as well as preventing these
methods from practically being able to estimate the power consumption of different
applications. This would require each individual application to be defined in terms of
voltages and currents for each net in the design which is unrealistic.
Gate-Level Methods
Gate-level methods abstract out of the transistor level by providing power models based
on the type of gates present in a design. Characterizing the power consumption of
particular gates–usually done in CMOS–allows for a system-level estimate of power
consumption for small designs.
Like transistor-level methods, gate-level methods provide high-levels of accuracy
compared to other methods due to the precise, closed-form models that are available
at such a low level of abstraction. Popular methods for gate-level estimation have
been developed by Chou [18, 19] and improved upon by Saxena [78], Gupta [36], and
Soeleman [90]. A more thorough survey can be found in Najm [64].
As gate-level descriptions of systems are only a small level of abstraction from transistor-
level descriptions, gate-level methods are equally unsuitable for both early stage power
estimation as well as application-level power estimation. Further, such a low level of
abstraction means that once the design has been specified to the gate level, it is often
28
too late to provide a meaningful solution to high power consumption problems.
Microarchitecture-Level Methods
The next level of abstraction from the gate-level is the microarchitecture level. This
level is where larger blocks of combinational and sequential logic are defined such as
registers, register files, memory controllers, and ALUs. Further, this level of abstraction
represents the first level of abstraction that is no longer dependent on a specific layout
of transistors.
Also included in these works are register-transfer level (RTL) methods. The register-
transfer level is above the gate-level description of a circuit and describes the flow of
signals from registers and the digital logic operating on these signals between registers.
Descriptions of circuits on the register-transfer level are written in a hardware description
language (HDL) such as VHDL or Verilog. In some surveys these two levels are distinct
but the two are similar enough in the context of this review that they will be considered
at the same level of abstraction.
The methods here rely on creating models based on the statistics of the input signal
characteristics. These statistics include the transition probability, the autocorrelation
coefficient, and the distribution parameters of the input signal. These statistics will affect
each component separately depending on the underlying technology and implementation.
One of the earliest works on the architecture level was done by Powell [74] which uses
the concept of the Power Factor Approximation (PFA) to estimate the power dissipation
of memory and ALUs. The PFA of a component depends on the usage level of the device
and the model of power consumption is generated using this as an input.
Other methods use a variety of input signal characteristics in order to generate power
models. Clarke et al. [20] constructs power models based on assumptions about the
distribution of the input signals as well as the word length of these signals. Others use
the autocorrelation coefficient and transition probabilities [46, 49], transition density and
29
probabilities [105], as well as spatial correlation [37]. In fact, there has been much work
done in this field but all run along this same vein of exploiting input signal characteristics.
Some popular tools for performing microarchitecture-level power estimation are Sim-
plePower [103], Wattch [11], and SoftWatt [38]. SimplePower uses a transition-based
approach which calculates the switching capacitance of each transition for each com-
ponent using a look-up table. A big issue here is the generation of this set of look-up
tables as it needs to be done for each architecture. Additionally, the size of these look-up
tables increases exponentially with the number of inputs.
Conversely, Wattch and SoftWatt use a fixed-activity model and instead count the
number of accesses of a specific component and the switching capacitance associated
with that component. This results in a much faster estimation. These tools are all
cycle-accurate simulators and require a compiled version of the application to be tested.
Microarchitecture-level models can be provide high levels of accuracy but are still slow
to perform power estimation, especially cycle-accurate simulators. They also require
low-level circuit details or an HDL description of the target architecture. Additionally,
they are unable to provide simple power consumption estimates for specific applications
due to the low-level nature of the model inputs as they require detailed knowledge of
the signals generated by an application to various components of the design.
Functional Level Power Analysis Methods
Functional-level power analysis (FLPA) further abstracts from the register-transfer by
considering groups of components as functional units. First introduced by Laurent et
al. [51], FLPA uses these functional units as the basis of its analysis.
For example, Schneider [79] decomposes the system into the fetch, processing, clock,
L1 cache, co-processors, EDMA/QDMA, and internal memory units. Much of the early
work on FLPA was done for Digital Signal Processors (DSPs). These methods use
parameters such as the parallelism and processing rate. Zipf [ 107] generated an FLPA
30
model for soft processors and decomposes the system into a smaller amount of units:
data cache, instruction cache, and integer units.
Functions for power or energy consumption are then derived which depend on the
interactions between the different functional units of the system. For instance, between
the external memory and instruction cache there exists the cache miss rate as well as
the instruction read rate. Similarly, interactions between the external memory and the
data cache depend on the data access rate. Other parameters include the dispatch/fetch
rates and the degree of parallelism of the application. These parameters differ slightly
between works but maintain the same theme and requirements of analysis. Schneider [79]
considers a combined cache miss rate, while Ibrahim [42] considers the cache miss rate
for read and writes separately.
In many situations, the required parameters (cache miss rates, dispatch rates, etc.)
will not be available to designers as they require complicated analysis of the source code
or special profiling tools, which may not be available. Further, these parameters are very
dependent on the application being executed and on the properties of the architecture
as well. A proposed change in cache size requires a complete re-characterization of the
system in order to make revised estimates. FLPA methods rely on the specific architecture
parameters and system configuration, though not as much as instruction-level methods
(discussed next).
Instruction Level Power Analysis Methods
Instruction-level power analysis (ILPA) marks the first of the high-level methods for
power estimation. Introduced by Tiwari [96], ILPA attempts to estimate the power
consumption of an application by first characterizing the energy consumption of each
instruction in the instruction set. By summing the energy of all the executed instructions,
an estimate of the total energy consumed can be obtained. In addition the base energy
cost of an instruction, there are additional inter-instruction costs which occur due to
31
sequential execution of two instructions. These can be calculated via the Control Flow
Graph (CFG) [94], reference instructions [65], or Colored Petri Nets (CPN) [13, 57],
The cost of a given instruction can change due to inter-instruction effects such as a
different circuit state prior to instruction execution, pipeline stalls, and cache misses.
Tiwari et al. [96] developed per-energy instruction costs for two specific processors,
as indeed this must be done for any processor used in ILPA methods. Brandolese [6]
attempted to make this slightly more general by extending the idea of inter-instruction
costs to encompass architectural aspects. Ou [69] performed ILPA on the MicroBlaze
soft processor, giving the reason that previous power estimation methods could not be
properly applied to soft processors due to their unique construction.
ILPA has been further abstracted to the function level by Qu [75]. JouleTrack [87] is
a tool that has been developed as an optimization framework built on instruction-level
power estimation.
There has also been work done in using hybrid models that incorporate aspects
from both FLPA and ILPA. These methods characterize the energy consumption of an
instruction as well as the activity levels of individual components. One of the first was
Mehta [58] which considered fine-grained decomposition of the datapath and control
components, such as shifters, multipliers, and interconnect. Fei [25] takes a similar
route for extensible processors, requiring dynamic resource usage statistics to determine
additional power consumption due to custom instructions and hardware. Zipf [107]
combines ILPA and FLPA methods by separating the effect of instructions into functional
units of the processor; in this case, the functional units are the instruction cache, the
data cache, and the integer unit. Within these functional units, the effect of cache misses
on the power consumption is examined and used in the modeling process.
Ultimately, the issue with ILPA methods is the dependence of the characterization on
the underlying architecture. There is little way to know how a particular algorithm is
affected by architecture parameters, such as instruction cache size or multiplier type.
32
Therefore ILPA methods are able to work with any application but are limited to the
particular system configuration the model was built on.
Specification-Level Methods
The highest level of abstraction is the specification level. The specification level includes
methods that rely on high-level parameters in order to produce estimations. Specifically,
these parameters will not require implementation of either the processor or algorithm.
These will be parameters such as cache size, types of arithmetic operations, or clock
frequency.
Senn et al. [81], further abstracting from their FLPA work, modeled the power
consumption of the NIOS 2 based on the cache miss rate from the FLPA as well as
frequency and temperature, which are two high-level parameters. This modeling was
done using simple linear regression and only on a single application: the Dhrystone
benchmark. Therefore the work does not show the affect that the application has on the
performance metrics.
Abstracted further than ILPA methods, Li [54] performs linear modeling on operating
system (OS) level routines. The only input to the model is the number of Instructions
Per Cycle (IPC). The authors construct models on both the OS and OS routine levels
relying on IPC that have varying levels of accuracy. Though the actual value for IPC for
a given application requires implementation, it is possible to estimate this value based
on the apparent parallelism of this application.
Cambre et al. [15] evaluated energy usage of the NIOS 2 based solely on instruction
and cache sizes. Using a number of different applications, the authors present a simplistic
linear model for the prediction of energy for an algorithm based on its temporal complexity
(i.e. execution time versus data size). The work shows the relationship of algorithm
complexity to energy usage which abstracts away from the hardware level. In doing
this, architecture parameters are lost and it becomes impossible to use the model for
33
applications such as early design space exploration.
Cambre et al. [14] also performed a similar study to their cache size experiment by
examining the effect of arithmetic hardware on the energy usage of various algorithms
but did not attempt to construct a predictive model. Instead, they made generalizations
about the most efficient choice of arithmetic hardware; the conclusion was that, simply,
choosing hardware implementations of arithmetic operations was more energy efficient
than having software-based operations. However, little consideration was given to the
concept of resource allocation.
Along the same vein, Givargis et al. [34] performed an experiment attempting to
optimize for power rather than energy using cache sizes in addition to CPU-cache bus
parameters. Similar to the work mentioned in the previous paragraph, little consideration
is given to the problem of resource allocation.
While the models so far have used only a small amount of parameters, Azizi et al. [2]
use a larger number of parameters. These include the branch target buffer (BTB) size,
cache sizes, fetch/decode/register renaming latencies, ALU latency, DRAM latency, and
so on. However, the work uses only a small subset of applications taken from the SPEC
CPU benchmarks and does not include any application parameters in the modeling
process; this confines the model to only a small subset of applications.
Similarly, Lee et al. [52] considers a large number of architecture parameters. These
parameters include cache sizes, pipeline depth, register counts, memory latency, control
latency, and ALU latency. In addition to these, the work considers a number of
application-specific parameters such as cache misses, instructions per second, branch
statistics, and stall statistics. However, these parameters are taken from a single execution
on a “baseline” configuration of architecture parameters. By doing this, some aspects
of the specific application can be captured in the modeling process while not having to
explicitly generate all possible architectures. This requirement for the application-specific
parameters requires both an implementation of the processor and algorithm although
34
provides much more flexibility in the early design stage than most other approaches.
2.1.2 Predicting Execution Time
The techniques for predicting power consumption that were at a low level of abstraction
have requirements that make it difficult to represent a high-level idea, such as an
algorithm, at that level. Predictive models for power consumption are only interested in
how various components contribute to the overall power consumption of a system based
on low-level parameters.
Predicting execution time is more difficult as the factors that contribute to execution
time have have high-level, complicated interactions. All power consumption models are
based on power consumption of an individual transistor while execution time is based
on the relationship between individual component latencies, cache sizes, memory speeds,
pipeline depth, and so on. Therefore these methods must be sufficiently high level to
allow for the distinction between different applications. Execution time predictions come
at the ILPA/FLPA level of abstraction and higher; defining an application in terms of
low-level parameters such as transition probability is difficult and impractical.
Existing methods predict execution time in three different ways. The first is predicting
the energy of the execution which combines execution time with the power prediction
discussed in the previous section. The second is predicting the speed of execution
reported in instructions per second. The last is predicting explicitly the execution time
or the number of cycles required for execution.
Virtually all ILPA-based methods [3, 13, 29, 33, 57, 58, 65, 87, 96] predict the average
energy consumption of each instruction which is the product of power and execution
time. The sum of the costs for every instruction in an application will give an indication
as to the energy efficiency of this application but little else. This method also requires
characterization of a complete system (processor, off-chip RAM, etc.) and is therefore
difficult to generalize to different system configurations or even different parameters
35
of the processor itself. The characterization process is laborious as each instruction
must be characterized as well as the interactions between instructions. Further, simply
knowing the energy consumption of an instruction does not give an indication as to
underlying reasons for why different combinations of instructions or configurations
consume different amounts of energy; this makes it impossible then for ILPA methods
to assist in architecture choices as the energy costs for an instruction set are exclusive
to a particular system configuration. A simpler method of instruction-level estimation
is done using Artificial Neural Networks (ANN) by Oyamada [70] which uses only the
counts of various instructions; this suffers from the same aforementioned problems with
the additional problem of not being able to include the inter-instruction costs.
Wang et al. [101] take ILPA slightly further and move to instrumenting the source
code to give a clock cycle count of the execution of an application. This is done by
static analysis of the source code which is then appropriately mapped to a user-definable
instruction set for a superscalar processor. Superscalar scheduling is then done via
this static analysis. The cache and branch prediction performance are both simulated
via respective simulators. Similar work was done by Wu [102] in bounding worst-case
execution times for single- and dual-core processors. Ultimately, these methods suffers
from many of the same drawbacks as other ILPA methods which requires extensive
characterization of the target processors as well as having detailed information on the
implementation of the underlying architecture. These ILPA methods have the distinct
advantage of being able to predict the performance of any application to be executed
once the expensive initial characterization of the architecture is complete.
Moving away from instruction level methods, Mohsen [61, 62] implements a component-
level model for execution time and power prediction, reminiscent of FLPA methods.
The method works by characterizing a number of different RTL components such as
adders and multipliers; additionally, different implementations of each component are
characterized and the power consumption, delay, and area of each implementation is
36
stored. These metrics are modeled using simple linear regression using inputs such as
I/O bus sizes and–for more complicated components–number of states in the Finite
State Machine (FSM). However this method suffers from the same issues as that of the
ILPA-based methods, requiring the source code of the application to be implemented
as well as characterization of both components that will and will not be used in the
final design. Further, cache and off-chip memory issues are not considered in this work.
Memory access issues are touched on in later, similar work by Kempf [47].
Still working at the component level, Fei [25] generates a hybrid model for extensible
processors that uses both components and instructions. Energy estimates are obtained by
simulating the instructions and analyzing component activity via dynamic resource usage
analysis. Characteristics of the execution, such as branching statistics, cache misses, ALU
usage, and instruction mixes are taken into account. As this is for an extensible processor,
side effects due to custom instructions and hardware are also considered. Senn [80] takes
a simpler approach and models energy consumption for certain DSP processors as a
function of a small number algorithm parameters and architecture parameters. The
algorithm parameters are acquired from static code analysis and consist of parallelism
rate, processing rate, cache miss rate, off-chip memory access rate, and the activity
rate between the memory controller and the Direct Memory Access (DMA) controller.
Architecture parameters consist of clock frequency, DSP memory mode, data mapping
mode, the DMA data width, and a power management parameter which represents units
in sleep mode. The work is extended by the same authors [81] to predicting the energy of
a single algorithm as simply a function of its data cache miss rate. While the algorithm
parameters require the source code implementation, the architecture parameters are
done at the specification level. Further, defining an application in terms of its cache
miss rate is cumbersome approach requiring detailed knowledge of the data locality as
well as cache behavior which may not be accessible to end users.
37
Another class of methods involve the generalization of results from few cases to many
cases. For instance, Lee [52] predicts the performance (in instructions per second) and
power consumption of a very large design space using only a few design points. This
is done by using both application and architecture parameters. These architecture
parameters consist of specification level parameters such as cache sizes, memory latencies,
arithmetic unit latencies, and numbers of registers. Application parameters are obtained
by running the application on a “baseline” hardware configuration and consist of
parameters such as instructions per second, cache misses, branch statistics, and sources
of pipeline stalls. This approach limits the number of architectures that must be generated
and tested in order to obtain the optimal configuration. However, it can be difficult to
see how the architecture parameters affects the execution of individual applications if
only a single baseline is taken. This work is extended by the same authors [2] which
gives consideration to the effect of voltage scaling on optimization, as well as considering
both in-order and out-of-order processors.
The PACE framework by Nudd [67] and the framework by Snavely [89] attempt to
predict the performance (in instructions per second) for many-core, HPC applications.
The PACE framework works by source-code analysis which translates the application
into a “performance language” called CHIPS, which is instrumented source code that
describes “objects” (or component performance aspects) of the application. Hardware
objects are also defined which represent the hardware aspects that include available CPUs,
caches, and the interconnect abilities of the system. From here, the system attempts
to predict performance through either static and dynamic simulation. Snavely [89]
approaches the problem in a similar way and works by analyzing the performance of
the software kernel on a single-processor cycle-accurate simulator. With this analysis
and a model of the hardware interconnect demands, the performance of system can be
predicted. Both of these approaches require not only an implementation of the desired
application, but detailed knowledge of the hardware it will be running on.
38
Ultimately, execution time is predicted at the source-code level, requiring designers to
have an implementation of their desired application before knowing how it will perform.
Further, nearly all methods for execution time prediction require lengthy characterization
of not only the processor used, but of the entire architecture.
The previous two sections have discussed the variety of ways to predict power consump-
tion and execution time. However, in order to predict accurately meaningful parameters
of the application must be extracted.
2.2 High-Level Parameter Extraction
An important contribution of this work is the extraction of parameters from the algo-
rithms. Estimation of power consumption and execution time relies on having some
information about the application being executed, therefore these methods must have
some way of obtaining these characteristics.
In general, only the higher level prediction methods are able to provide application-
specific models. ILPA methods use the source code and inter-instruction costs to provide
application-specific predictions without parameter extraction, making them usable for any
application at the expense of being very specific to architecture and system. Conversely,
FLPA methods require characterization of the application to determine its effect on the
functional units of the system; this characterization involves the extraction of application-
specific parameters. When first introduced by Laurent [51], the application-specific
parameters were obtained by static analysis of the source or assembly code. Specific
for DSPs, these parameters are the parallelism rate α, the average number of active
processing units per cycle β, cache miss rate γ, program memory access rate , and data
memory access rate τ . Schneider [79] extended these parameters by separating memory
access rates into read and write access rates, separating program and data cache miss
rates, and including activity rates for dedicated co-processors included in the system as
39
well as the pipeline stall rate (PSR). Ibrahim [42] further extended these parameters by
including cache access rates. In fact, many FLPA methods use variations and extensions
on these parameters, all of which are obtained through static analysis of the source or
assembly code. FLPA methods are also specific to the particular system and architecture
that was characterized. A change to the components requires the entire FLPA process
to be performed again.
Other than FLPA methods, Enzler [23] approached application-specific power pre-
diction using a microarchitectural approach. Here an application is described using a
data flow graph (DFG) and from this, 23 parameters are extracted. These include the
word length, average fan-in, and the number of inputs, outputs, adders, multipliers,
multiplexers, LUTs, and registers. Also included are the “inherent degree of parallelism”
and the number of iterations of the algorithm. The DFG of an algorithm is slightly more
general than using the actual source code but ultimately depends on the characterization
of the particular architecture and system, similar to the problems with FLPA methods.
Lee [52] uses experiment-based extraction of algorithm parameters in order to quickly
perform design space exploration. Applications are executed on a “baseline” architecture
and execution characteristics are recorded; this are are instructions per second, cache
misses, branch rate, branch mispredictions, branch stalls, and the types of pipeline stalls.
This work was extended to allow prediction of applications and how the performance and
modeling method scales with problem size [53]. Though this requires both implementation
of the system and algorithm, it allows for the performance of an application to be predicted
on unseen architectures in a large design space. However, requiring the statistics of a
baseline run of an implemented algorithm means that it would be difficult to provide
designers with a simple framework to provide them with early design guidance.
Ultimately, extraction of high-level parameters has been done but the previous work
largely ignores the relationship between the application and the architecture it is being
executed on. This relationship needs to be easily explored especially for large design
40
spaces. This ease of exploration includes fast extraction of parameters as well as a fast
evaluation process. In doing this, early estimation of performance can be translated into
design space exploration.
2.3 Design Space Exploration
Modeling is an important part of DSE research as it allows for the architecture space to
be explored without the need to generate all possible architecture combinations of a given
system. Machine learning techniques are popular methods for tackling this problem.
Ipek [44] uses artificial neural networks (ANN) to reduce the size of the architecture
space as well as providing a feature-rich DSE framework. Similarly, Cho [17] uses wavelet-
based ANNs to analyze the workload of various applications in order to make better
microarchitectural design choices. A type of linear regression is used by Hallschmid [39]
to reduce the design space and to evaluate the affects of either one or two parameters in
order to optimize the energy consumption of the processor. Schafer [16] uses various
machine learning techniques combined with genetic algorithm (GA) techniques and
source code analysis to produce a fast method for exploring the architecture design
space.
In addition to modeling-based techniques for design space exploration, there are a
number of techniques which progressively generate architectures in order to tune the
performance of a specific algorithm. Yiannacouras [104] uses the SPREE architecture
to perform application-tuning on soft processor architectures, while Sheldon [84] uses
in-the-loop synthesis to optimize designs in terms of area. Dimond [22] considers the use
of custom instructions and multi-threaded applications in this DSE framework which
uses source code analysis, cycle-accurate simulation, and place-and-routing.
The issue with the above work is that all of these works are dependent on the
application being executed. However, if the application is optimized or changed, the
41
DSE process must be repeated in order to accommodate the changes to the application.
For the designer, this means more time will be required which will incur additional cost.
Ultimately, this work seeks to improve upon previous work by allowing the designers
earlier estimates allowing for exploration of the architecture and, potentially, algorithm
space.
Design space exploration looks at the effects of the architecture choice on the system as
a whole, seeking to the find the most optimal choices with respect to a metric. Extending
this, the prediction of the effects of single event upsets seeks to consider the effect of
architecture choices on this growing problem. The next section looks at the methods for
handling SEUs and the effect on the system.
2.4 Predicting Effects of Single Event Upsets
Most of the work reviewed here seeks to mitigate the effects of Single Event Upsets
(SEUs) through architectural or application changes; it looks at the cost of mitigation
which differs from the work presented in this thesis which looks at the cost of reparation.
The concepts are fundamentally different but both deal with the handling of the effects of
SEUs. Ultimately, either or both will need to be considered in future designs. This review
focuses on previous works that discuss the application- and system-level implications of
SEUs in terms of performance overhead.
SEUs have, historically, been an issue with space and high-altitude applications of
semiconductor circuits. An SEU occurs when the state of a logic element (e.g. a latch or
memory cell) is changed due to the low-level interaction of radiation and the junctions in
transistors. However, SEUs are becoming a larger problem due to the decreasing feature
size of modern electronic devices. The details of this problem will be discussed further
in the SEU section in Chapter 6.
42
Most of the work dealing with the effects of SEUs seeks to predict the low-level effects
of SEUs. These low-level effects include general faults [5, 56, 71, 98], faults with no
effect [10, 71, 100], the crashing or halting of execution [10], increase in error for iterative
methods [10], and computational errors [106].
Touloupis [98] examines the effects of SEUs between a normal architecture and a
fault-tolerant soft-processor architecture. The authors consider various types of faults
that can result from SEUs that include “no-effect”,“latent”,“wrong result”, “timed
out”, and “exception.” These are all self-explanatory except for “latent” faults. In
this context, a latent fault is one that produces the correct output but the content of
the pipeline registers or register file is corrupted. These faults were examined only in
pipelined datapath of the LEON2 processor. Other than “no-effect” and “latent” faults,
the others prevent proper execution of an algorithm after an SEU occurs. To deal with
this, the authors implement a data pipeline that is triplicated which uses a comparison
scheme to detect and correct data corruption in the pipeline registers and the register
file. The end result allows recovery from SEUs at the expense of area and performance;
this triplicated pipeline results in an increase in area of 26.6% and a decrease in fmax
(maximum operating frequency) by 23.7%. The detection and correction system in this
scheme relies on the ability to rewrite the contents of registers. SEUs that occur in
the FPGA fabric are more difficult to correct and rely on the ability to reconfigure or
partially reconfigure the device itself.
The effect of SEUs can be dependent on the application being executed. The previously
mentioned work [98] shows that depending on the characteristics of the application,
such as the dependency on, for example, ALUs, the severity of the effects of SEUs can
vary. Similarly, Bronevetsky [10] examines the vulnerability of iterative methods to
SEUs. In addition to typical faults associated with SEUs (no-effect, timed out, and
exception), this work examines the increase in error associated with how data corruption
propagates through successive iterations. Fault tolerance in this work is implemented
43
via three methods. The first is the set of correctness tests and assertions present in
the software library that was being used. The second compares the residual norms
of the current iteration compared to the previous iterations. The third uses a linear
error correcting code which adds an additional row and column to all matrices that
holds the sum of each row or column. Each of these methods adds a certain amount
of overhead to the execution time and can range from near 0% to 450%, depending on
the parameters of the fault-tolerance method. These methods can also introduce false
positives in SEU detection, having a maximum of 14% false positive rate. Effects of
SEUs on iterative methods were also covered in later work by Shantharam [83] who
used a graph representation to model SEUs. SEUs are detected here but overhead is not
considered.
Lu [56] examines application-specific effects due to SEUs. This work explores effect that
SEUs have on large clusters built from Commercial Off The Shelf (COTS) components.
In general, the work shows that about 35% of SEUs result in a functional error in the
system due to errors in either registers or messages passing between compute nodes.
These errors in messaging have a large affect, with between 25-71% resulting in incorrect
outputs. It concludes that the severity of how SEUs affect different applications can be
attributed to the robustness of the implementation of that application. No quantitative
results are given, but it suggests that the efficacy of SEU mitigation techniques depend
largely on the methods with the application itself; the work cites Silva [86] saying that
the execution time overhead of software-based methods for fault-tolerance averages about
10%.
The work presented in this thesis gives an indication to designers of the vulnerability
of their design and application. This allows them to consider these mitigation techniques
in order to perform cost benefit analysis. In contrast, this work allows designers to see
what affect architecture choices have on the vulnerability of the system to SEUs; in
turn, this allows designers to perform this same cost benefit analysis compared to these
44
mitigation techniques. Ultimately, the work presented in this thesis allows the designer
to see the actual cost of SEUs (in terms of execution time) and doing so early in the
design.
2.5 Conclusion
A variety of methods to predict power consumption and execution time were reviewed
with a focus on the high-level methods. In particular, ILPA methods characterize the
cost of individual instructions but suffer from being specific to a particular system
configuration and set of architecture parameters. As a consequence, ILPA methods are
difficult to use in a design space exploration application even for a small number of
parameters. FLPA methods have the same drawbacks, but to a lesser degree, Instead,
algorithm parameters used in FLPA methods require complicated analysis of the source
code in order to be extracted. Specification-level methods work at a higher level than
ILPA or FLPA methods, but do not provide a comprehensive view of performance or
resource usage.
High-level prediction methods require high-level parameters extracted from the archi-
tecture and the application. However, the previous work focuses on complicated and
difficult-to-extract parameters from the applications. Further, the relationship between
architecture and application parameters is not fully explored or exploited.
Previous work in design space exploration have shown a wide range of approaches.
The work presented here seeks to provide earlier guidance to designers as well as a more
comprehensive approach, such as providing resource usage details in addition to power
consumption and execution time optimizations.
Finally, the cost of SEUs, either of mitigation or reparation, was explored. This work
seeks to provide the designer with the actual cost of SEUs early in the design where cost
benefit analysis can be performed.
45
3 Background
This chapter provides necessary background on the chosen modeling technique: regression
trees. It covers how the trees are constructed as well as the advantages of regression
trees over other methods. Regression trees are a machine learning technique that use
successive splits of data in order to produce predictions. Being applied in a wide variety of
fields [50], machine learning techniques are a flexible class of methods. As many are used
for gathering and processing information, this class of methods offers a number of viable
options in order to create a prediction technique that is able to handle the interactions
of high-level algorithm parameters and architecture parameters. More importantly,
machine learning methods are able to use these interactions to make predictions.
The selection of an appropriate modeling technique depends on a number of factors.
There are two important issues when selecting a modeling technique: one is the linearity
of the relationship of the predictor variables (the input) to the response variable (the
output) and the second is the variable type (either discrete or continuous) of the response
variable.
One assumption of this work is that the relationship between the predictor variables
and the response variable is non-linear. Specifically, the response variable is non-linear
with respect to the previously mentioned architecture parameters [88]. This assumption
is justified by, for example, looking at the non-linear behavior of the cache. Underlying
non-linear factors such as line replacement strategies and memory access times justify
this assumption.
46
The variable type of the response determines which type of model needs to be used.
A categorical response from a model is called classification where a continuous response
is called regression. In this case, performance metrics such as power and execution time
are continuous variables therefore the chosen model must be able to perform regression.
The variable type of predictor variables can also impact the selection of a modeling
technique but to a lesser extent.
A widely used prediction method that can perform regression using both categorical
and continuous data with a range of benefits over other methods are regression trees.
3.1 Classification and Regression Trees
A regression tree uses a succession of binary splits of the data using a set of observations
and their responses. Compared to other modeling techniques, regression trees offer
many benefits. These benefits include flexibility of input data and the ability to model
conditional information. After a regression tree has been constructed, its structure is
transparent. That is, how the model arrives at a prediction can be seen by the user which
gives insight into variable dependencies and effects. A few key shortfalls of regression
trees can be handled by using multiple regression trees called an ensemble or forest.
To begin with, it will be explained how the chosen algorithm constructs a simple
classification tree and then how the same concepts can then be applied to the con-
struction of a regression tree. After this there is a discussion of the advantages and
disadvantages of regression trees followed by a description of the method used to mitigate
the disadvantages.
3.1.1 Building a Classification Tree
The particular classification and regression trees used here (called CART) were introduced
by Breiman [9]. The simplest classification tree is one used for two-class classification
47
(that is, two possible values of the response variable). Due to their flexibility, nearly
all the same concepts can be taken from a two-class classification tree and applied to
n-class classification, regression, or multivariate outputs. Classification and regression
trees are in the form binary trees; n-ary trees are possible but increase training time
and result in limited benefit [26].
A classification tree is succession of splits that divide the tree into two disjoint subsets
of data. Once it is determined that further splits are not needed at a node, the particular
node in question is then considered a terminal node. A terminal node for a classification
tree determines the class of that particular terminal subset. There are three important
parts to the formation of a classification tree:
• selecting node splits;
• assigning a class to a terminal node; and
• determining to split a node or terminate the branch.
Arguably the most important part of a classification tree is how to perform the binary
splits on the input data. To approach this problem, an appropriate metric for assessing
the splits must first be discussed. For classification, the approach is to make each split
of a subset produce descendant subsets that are more “pure” than the parent subset.
The impurity of node is dependent on the proportions of classes in the data associated
with that node. A typical measure of impurity is called the Gini index of diversity [32].





for total classes n and fraction of items fi labeled with class i. This value is at its
maximum when classes have equal distributions and at its minimum when only one class
is present.
48
A split is defined as a binary test on the value of a single variable from a total of m
variables. For a continuous variable xj where j ∈ [1,m], the test is of the form xj ≥ c,
where c can range from (−∞,∞). For a categorical variable, the test is of the form
xi ∈ S, where S can range over all subsets of possible values for xi. For each variable,
the classification learning algorithm finds the values of S which produces the greatest
decrease in impurity of the parent subset and the two descendant subsets. The variable
which has the greatest decrease in impurity is used as the splitting variable for that node.
The total decrease in impurity takes into account the impurity of the current node as
well as the impurity of the two descendant subsets. Specifically, the change in impurity
IG for a split s in a node t and descendant left and right nodes tL and tR is defined as
∆IG(t) = IG(t)− IG(tL)− IG(tR).
The best split sB for a given node will be the one that maximizes ∆IG(s, t) across the
set of all variables S.
∆IG(sB, t) = maxs∈S(∆IG(s, t)).
This method of splitting subsets is recursive and continues until a node is reached
where it is decided that no more splits should occur. At this point, the label assigned to
that node is the majority class of the data associated with that node.
A common criterion to determine that no more splits should occur is to compare the
maximum decrease in impurity versus a set threshold β. That is,
∆IG(sB, t) < β.
This, however, leads to the problem of selecting an appropriate value for β. A β that
is too small will lead to large amount of splits and a tree that is too large. Alternatively,
a β that is too large prematurely declares nodes terminal which means the tree can
49
potentially lose splits that would result in a large decrease in impurity [9].
The most common way of dealing with this problem is to initially grow a large tree
and then prune the tree at the end of construction. This ensures that the final tree
is near-optimal. However, pruning is not used for reasons that will be discussed later
therefore it will not be discussed further.
From Classification to Regression
The extension from classification to regression is a simple one as most of the concepts
from constructing a classification tree apply to regression trees as well.
The concept stays the same: a succession of splits that divide the tree into two disjoint
subsets of data at which point a label can be applied to the data at terminal nodes. The
three important parts to the formation of a classification tree are modified for regression
trees to
• selecting node splits;
• assigning a value to a terminal node; and
• determining to split a node or terminate the branch.
In the classification problem, the value assigned to a node is the majority class in that
node’s data. For regression, the value assigned to a node is the mean of the response
variable for all data associated with that node.
In classification, the determining of the best splits was related to the total decrease in
impurity of the current node and the two descendant nodes. In regression, the impurity
measure is exchanged for the squared deviation of the node data from the mean of their
response variable. The residual sum of squares R(t) for a node t with response vector y








For any split s, the change in R(t) is defined as
∆R(t) = R(t)−R(tL)−R(tR).
The best split sB for a given node will be the one that maximizes ∆R(s, t) across the
set of all variables S.
∆R(sB, t) = maxs∈S(∆R(s, t))
It is important to note that these statistics are stored for each node. This information
can be used later during prediction: a node with high deviation is less likely to give an
accurate prediction given a test vector.
The last issue is when to determine to stop splitting a given node. As with classification,
the problem is determining the correct criterion to halt the growth of the tree. A
common criterion is to grow the tree to maximum size with the constraint that all
terminal nodes are above a minimum size. A node’s size is defined as the number of
observations associated with that node. After the tree is grown, it is then pruned as in
the classification case.
3.2 Advantages
In the context of high-level estimation and design space exploration, regression trees will
be shown to be well-suited to the task. One of the important aspects is the transparency
of the tree; by viewing the tree structure, the more important variables and their
relationships to other variables can be seen. While this is useful in many modeling
applications, it is especially useful here because it allows the designer to see the exact
parameters that are affecting the performance of the system. Regression trees also have
the ability to handle conditional information in structure of data and variables; this is
due to each split being calculated separately without influence from the parent node.
51
A classification tree for a common data set (named kyphosis) is shown in Figure 3.1.
The ellipse-shaped nodes are decision nodes and the square-shaped nodes are terminal
nodes (or leaves). Inside every node the distribution of classes is shown with the label
being the majority class. The edges connecting the nodes indicate the variable on which




























Classification Tree for the Kyphosis Data set
Figure 3.1: Sample classification tree
By examining this tree, important information of the data can be deduced. The
data set this tree is modeling contains 81 samples of three predictor variables. The
response variable is a binary response indicating the presence of kyphosis (a form of
spinal deformation) after the subjects underwent spinal surgery.
52
The three predictor variables are
• Age of the subject in months
• Number of vertebrae involved in surgery
• Start – the number of the topmost vertebrae operated on.
Looking at the root node, it shows that there 64 cases where kyphosis is not present
and 17 where it is. The first split is on the Start variable. Going to the right, it shows
that any operation starting between the 1st and 8th vertebrae typically results in the
development of kyphosis, regardless of other factors. Conditional information in the
model can also be seen in the second split. Any operation involving 15 or more vertebrae
typically results in the absence of kyphosis. However, anything greater than a value of 9
for Start but less than 14 requires other factors to predict the presence of kyphosis.
Note here that terminal node statistics give an indication as to the accuracy of any
predictions using that node. For example, all of the terminal nodes which have the label
“absent” have low impurity. Alternatively, the terminal nodes labeled with “present”
have significantly higher impurity values. This suggests that the level of impurity can
give an indication to the probability of misclassfication of a test input vector. That is, a
prediction which falls on a terminal node with high impurity has a greater chance of
being misclassified as a prediction that falls on a terminal node with low impurity.
This example highlights another important use of classification and regression trees
which is showing the structure of the data. While regression trees are typically used for
prediction, in the kyphosis example the predictive accuracy of the model is of secondary
concern. What is useful to the user is examining the effect that the parameters have
on the response variable. In this case, understanding the potential risks of surgery on
patients.
Another advantage that regression trees have over other modeling methods is the
flexibility of the input data. Versus parametric models, regression trees have the
53
important characteristic of being invariant to monotone transformations; the same tree
will be constructed regardless [21]. This is in contrast to, for example, linear regression
where a monotonic transformation of the data can greatly affect the model parameters.
3.3 Disadvantages
Regression trees suffer from a number of problems that harm its credibility as a predictor
if unresolved. These problems are ones of extrapolation, issues with certain response
variable structures, instability, local minima, and overfitting. The source of these
problems come from characteristics of the data; data that lacks homoscedasticity, has
correlated variables, or contains noise all contribute to these problems [9]. Some of these
problems can be mitigated by modifying the modeling approach slightly while others
depend on the structure of the training data.
An important limitation that must be recognized is that regression trees, though
accurate when interpolating, cannot perform extrapolation; the limits of the response
variable are defined by the input data. To overcome this, the training data must contain
samples representing the entire range of values that are of interest.
Classification and regression trees have difficulty modeling a response variable with
certain structures. A well-known issue for classification trees is its inability to efficiently
model the exclusive-or (XOR) function [92]. For regression, data with a structure that
is linear (or additive) requires many splits as opposed to being easily modeled using
other methods. This limitation is not an issue with the data used here as the underlying
structure is non-linear due to many interactions in terms of caching, arithmetic hardware,
and various algorithm parameters.
One of the other issues of note is the instability of regression trees. For instance, two
trees constructed by two slightly different data sets can produce two very different trees.
This is due to the recursive partioning of the data: changes in the splits high up in the
54
tree will produce very different branches further down the tree [99]. Because of this,
care must be taken when using regression trees as a predictors as two different trees
can have different characteristics in terms of prediction behavior. However, there is a
solution to this problem that will be discussed later.
The successive splits iteratively divide the data in an effort to continuously improve
the training prediction accuracy of the tree. This is a greedy approach and thus falls into
category of tending to find local minima and thus cannot guarantee a globally-optimal
tree. Similarly, classification and regression trees, if not treated properly, will overfit the
data. Overfitting happens when a tree is grown too large which–though providing low
training error– cause the generalization ability of the tree to diminish. Common in most
types of modeling, the overfitting problem can be handled by pruning the tree or using
cross-validation. However, another method of dealing with the overfitting problem and
other problems listed here is to use a collection of regression trees called an ensemble
or forest.
3.4 Regression Forests
Regression trees possess a number of shortfalls that make them undesirable as predictors.
The three biggest issues are that regression trees are unstable, they are only guaranteed
to find local minima, and they will tend to overfit unless grown to the correct size. A
common method of dealing with these issues is to use an ensemble method which is a
collection of multiple models.
The general idea behind ensembles is that each model within the ensemble is diverse
from the others. By having a collection of diverse models, stability and accuracy is
improved [12]. Ensemble methods cover a wide range of techniques but the one focused
on for this work is known as bagging.
55
There exists other ensemble methods and are namely random forests and boosting.
Random forests [8] create trees from a random subset of parameters using bootstrapping.
It is unsuitable for this case as the relationship of multiple variables are important to
producing low error predictions. Boosting [31] (specifically least-squares boosting in
the regression case) is not suitable in this context for similar reasons as random forests.
Boosted regression trees are typically small trees which use a subset of predictor variables
and thus lose many important variable interactions that are key to accuracy of the
model.
Bagging, or Bootstrap Aggregating, was first introduced by Breiman [7]. Using an
initial learning data set L, bagging works by generating a set of k models where each
model is constructed using a separate learning data set, resulting in the sequence{Lk}.
{Lk} is generated by randomly sampling L uniformly and with replacement. Therefore
observations from L may not appear in {Lk} or may appear multiple times. The response
value for the ensemble for an input vector x is the average response value of all predictors
using x as input. By applying this concept to the generation of many regression trees,
the prediction error of the model is decreased. Why bagging results in an increase in
accuracy is explained by Breiman [7]. Breiman explains that bagging generally increases
the accuracy of unstable predictors. By using bagging, the instability inherent in single
regression trees is decreased as well as the prediction error. It is also shown that bagging
generally improves prediction performance on predictors that tend to overfit; another
criticism of regression trees.
The effect of using multiple trees to provide predictions mitigates the problem of
instability. The structure of a single tree becomes less important as all trees have a
different structure and each contributes to the predictions equally. A consequence of
this is the direct insight provided by a single regression tree is lost. Additional insight is
gained, however, by the examination of all the trees in the ensemble; in doing so, all
variable splits and the subsequent decrease in error can be seen, which shows how each
56
predictor variable contributes to the overall accuracy of the ensemble across all trees.
Multiple trees also deals with the overfitting issue in an unintuitive way which is
encouraging each tree to overfit the training data it has been given through repetition of
training observations and the lack of pruning on the final tree. It has been shown by
Breiman [7] that this approach improves the generalization ability of a learned model,
especially in a bagging approach. Finally, a regression forest promotes movement towards
the global minima by ensuring diversity among learners.
57
4 Extraction of High-Level Parameters
This chapter describes the choice of high-level parameters of algorithms to be used in the
construction of predictive models for power consumption and execution time. First, the
image compression algorithms used to create the models are discussed briefly followed by
a more in-depth examination and how the algorithms were decomposed into basic blocks.
Then, the architecture and algorithm parameters used in this work are discussed.
4.1 Algorithms Used
This section describes the chosen domain of image compression, a brief overview of the
chosen algorithms, and the method and results of the decomposition of these algorithms.
4.1.1 Domain Selection
As the main goal of the work is to be able to predict the performance of algorithms
early in the design phase for a specific type of processor, a constraint must be put on
the algorithm space in order to be able to extract high-level parameters. This constraint
is that the algorithms chosen must have similar characteristics which means that all
algorithms should come from the same domain.
For this work, the domain of image compression was chosen. Image compression
algorithms, though diverse, share a set of common characteristics that allow high-level
parameters to be used that can capture the individual behavior of each algorithm. These
characteristics include domination by memory-accesses and simple arithmetic operations.
58
At the same time, there are some requirements that allow this domain to be predicted.
These are deterministic memory accesses and distinct phases of execution (this is so
blocking can be used, which is discussed later).
High-level parameters are an abstraction of aspects of these algorithms and therefore
lose the low-level details. Because of this, the domains that this framework can be
applied to must have differences between the algorithms within it able to be represented
at a high level.
A wide variety of algorithms were used to construct the model. These algorithms
include those that use the concept of transform coding (i.e. transform stage fol-
lowed by quantization and entropy coding) and consist of JPEG [ 43], JPEG 2000 [68],
JPEG XR [45], and WebP [35]. Also included in this work are less widely used methods
which consist of vector quantization [55] and quad-tree fractal compression [27]. In
the past, these were not popular due to high computation costs; due to increasing
compute speeds and more ubiquitous parallel processing, these methods are experiencing




The widely used JPEG standard consists of a transform stage and entropy encoding stage.
Beginning with the Discrete Cosine Transform (DCT), the input image is transformed
into the frequency domain. From here, the resulting coefficients are quantized and sent
to the entropy encoder. The standard permits the use of arithmetic coding, but Huffman
coding is the most commonly used entropy encoder and is the one used here.
59
JPEG 2000
A successor to JPEG, JPEG 2000 improves upon the problems that existed in JPEG
as well as adding more features. The JPEG 2000 standard uses the Discrete Wavelet
Transform (DWT), which improves over the DCT by including spatial information in
addition to frequency information. After the transformation, the coefficients are sent
to the block coding method called Embedded Block Coding with Optimal Truncation
(EBCOT). EBCOT scans DWT coefficients in an adaptive manner and then performs
entropy encoding using a version of arithmetic coding.
JPEG XR
The most recent addition to the JPEG group of algorithms is called JPEG XR.
JPEG XR uses the Photo Core Transform (PCT) which is an approximation of the
DCT. The PCT uses a combination of cascaded 2-by-2 Walsh-Hadamard transformations
followed by a number of one-dimensional and two-dimensional rotation operations. The
block coding for JPEG XR uses adaptive scanning of transform coefficients and adaptive
Huffman encoding.
Vector Quantization
Vector quantization (VQ) is a widely used method with applications in many areas,
including image compression. VQ is straightforward as it represents the original vectors
of the image with a smaller, representative set of vectors called the codebook. This
makes the compression ratio deterministic for VQ whereas the reconstruction error of
the compression is data dependent.
Quad-Tree Fractal Compression
Fractal compression, first patented by Barnsley [4], uses a mathematical construct
called an iterated function system. It was later extended by Fisher [27] to include a
60
method of partitioning called quad-tree partitioning. The quad-tree representation of
an image can be thought of as a tree-like structure with the original image as the root.
Each node of the tree corresponds to a square portion of the image that contains four
child nodes that correspond to the four quadrants of that square. The structure of this
tree depends on the image and is constructed during the encoding process. This set of
nodes (or ranges) are mapped to the domains which are taken from the same image.
The mapping is an affine transformation which minimizes the root-mean-squared (RMS)
difference between the transformed domain pixel values and the range pixel values. The
coefficients of this transformation and the optimal domain are stored in order for the
range to be compressed.
WebP
Developed by Google, WebP [35] was introduced as a competitor to JPEG that would
allow for more flexibility in the encoding and decoding process. At its core, WebP is a
key frame encoder for video compression taken from the VP8 standard. Being designed
as an internet-friendly compression algorithm, most of the processing is done on the
encoding side to allow for fast decompression on the web browser side.
4.1.3 Algorithm Decomposition
This section describes how each of the previously discussed algorithms are decomposed
in order to create a predictive model.
It must be noted that issues concerning quality and compression ratio are not considered
here. Image compression methods have a large number of parameters that can be changed
to achieve the best quality and compression ratio based on the characteristics of the
image and available computing resources. Constructing a model that takes into account
all of these parameters would create a large parameter space to explore due to the many
implications of these issues.
61
For compression ratio and reconstruction quality, parameters were chosen such that the
algorithm achieved a median result with respect to its own capabilities, unless otherwise
stated.
A single image was used in all of the algorithms. It is a randomly chosen, 8-bit
grayscale image from the FERET face data set with a size of 128 by 128 pixels, shown
in Figure 4.2. Although image compression techniques can vary in performance between
different images, an assumption here is that this variation would not be enough to
significantly change the power consumption or execution time, especially at this size of
image.
To show this, JPEG, JPEG 2000, fractal compression, and WebP were used to compress
three different images of sizes 128 pixels square and 512 pixels square. Table 4.1 shows
the execution time required for three different images being compressed 10,000 times.
(a) Image 1 (b) Image 2 (c) Image 3
Figure 4.1: Images used for testing compression times
The table shows that while the compression times for different images using the same
algorithm can differ, at most, by 48%, larger images differ by up to 300%. This is an
area of study for the future, but ultimately the choice of image is not a significant issue.
The key idea of the proposed framework is to break the algorithms into their basic
blocks rather than using the entire algorithm. The reason for this is two-fold: first,
compression algorithms are too complicated to be represented efficiently in a high-level
manner. Second, being able to combine blocks allows the algorithms to better represent
the algorithm parameter space; that is, instead of having two algorithms (e.g. JPEG
62
Table 4.1: Compression times for different algorithms over three images
Algorithms Image Number


















Figure 4.2: Test image used in data collection
and JPEG 2000), four combinations are created by using the combination of the two
transforms and the two coding methods.
The splitting of the algorithms was done in a high-level manner so that each algorithm
was broken into its two basic blocks. The split was done manually such that each block
was both a logical and computational component of the original algorithm. Another
constraint on the split is that both blocks must be sequential. That is, in one execution
of an algorithm, block “A” is executed followed by the execution of block “B” without
interleaving execution of the two blocks.
The rest of this section describes the blocks used in the modeling process. The details
63
of each block will be discussed with focus being placed on the aspects of the computation
that will relate to the soft-processor architecture.
Discrete Cosine Transform
Arguably the most widely used image compression transform, the Discrete Cosine
Transform (DCT) is the transform used in the original JPEG standard. The DCT is
used to transform the image (or blocks of the image) into the frequency domain; from
there, the highest frequency components are removed and the rest of the coefficients are
quantized and encoded. For this block, only the transform and quantization stages are
considered.
From the JPEG standard, the three DCT variants used were
• slow integer,
• fast integer,
• and floating point.
The implementations for these variants were adapted from those in the libjpeg library [43].













for k = 0, . . . , N − 1.
However, all of the variants perform some amount of optimization to reduce the number
of repeated calculations. The most notable is performing two separate one-dimensional
DCTs instead of the more complicated two-dimensional DCT. Another is the removal of
the calculation of the cosine function and using a look-up table instead. Additionally,
many of the redundant multiplications are removed so the majority of the operations for
the DCT are memory accesses along with additions.
64
The slow integer and floating-point variations perform the exact same operations
but with different levels of arithmetic precision. The fast-integer variant is a “considerably
less accurate” [43] approximation of the slow-integer variant.
Traditionally the DCT works in blocks of 8-by-8 pixels. In addition to this, a DCT
using blocks of 4-by-4 pixels is also included in the construction of the model. This
means that while a block is being processed, only a small cache is needed to hold this
small (64 pixels × 8-bits/pixel) working set. Therefore the DCT relies primarily on fast
arithmetic hardware for fast execution.
The decomposition of the JPEG algorithm is shown in Figure 4.3. The figure shows
the high-level stages of the JPEG along with relevant algorithm parameters, connected by
dashed lines. It shows the images as blocks which are transformed then quantized. This
is followed by the Huffman coding algorithm which consists of the frequency calculation
of the coefficients followed by the construction of the Huffman tree.
Figure 4.3: Example JPEG decomposition with relevant algorithm parameters
65
Huffman Coding
For the coding of coefficients from the DCT, the JPEG standard uses the method of
Huffman coding. Originally published by David Huffman [41], Huffman coding constructs
a tree based on the estimated frequencies of the symbols to be coded. Though only
optimal when the probabilities of input symbols are known (and are negative powers of
2), Huffman codes are fast to construct and require few arithmetic operations.
A Huffman tree, unlike the similar Shannon-Fano [24, 82] coding method, is built from
bottom to top. Each leaf of the tree corresponds to a symbol. The algorithm starts by
having a sorted list of symbols and their probabilities in descending order. To construct
a tree:
1. From the sorted list, take the two symbols with the smallest probabilities.
2. Add these to the top of the partially constructed tree and remove from the sorted
list.
3. Replace those items with an auxiliary symbol that represents both removed items.
4. Repeat until there is a single symbol left on the sorted list.
Apart from the probability calculation, Huffman coding contains very few arithmetic
calculations. Consequently, Huffman coding is highly dependent on cache size and
behavior for fast execution.
Discrete Wavelet Transform
The successor to the JPEG standard, the JPEG 2000 standard uses the Discrete Wavelet
Transform (DWT) as its base transform. It is applied in the same way as the DCT:
coefficients from the transform are quantized then encoded. The largest difference is that
the DWT uses wavelets whose main advantage is that spatial information is encoded in
addition to frequency information.
66
From the JPEG 2000 standard, the two variants of the DWT are defined as:
• Irreversible (lossy using CDF 9/7 wavelet)
• Reversible (lossless using CDF 5/3 wavelet)
The implementations of these variants were adapted from those in the OpenJPEG
library [68].
The wavelet transform works by convolution or by a hardware-friendly method called
the lifting scheme [93], which is used by these two variants. In essence, the lifting scheme
performs convolution “in-place” which saves memory and also reduces the number
of operations. Like convolution, lifting consists of mainly the multiply-accumulate
operation.
Unlike the DCT, the DWT performs a dyadic decomposition of the image. After the
initial DWT is performed, the image is decomposed into four separate subbands. There
are four subbands which are a combination of low and high-resolution horizontal and
vertical images. This decomposition and subsequent transform is applied recursively to
the subband of low resolution for both horizontal and vertical components. The process
can be seen in Figure 4.4. The number of times this decomposition is performed is called
the decomposition level. For larger images, higher levels of decomposition are needed;
for this work, only two levels of decomposition were used.
67
Figure 4.4: 6-level DWT dyadic decomposition
Embedded Block Coding with Optimal Truncation
Embedded block coding with optimal truncation (EBCOT) is the block coding method
used by JPEG 2000 which differs slightly from the original version introduced by
Taubman [95]. The JPEG 2000 version of EBCOT used here is adapted from the
implementation in the OpenJPEG library [68].
The core of EBCOT is a specific arithmetic coder called the MQ coder. The MQ
coder differs from other arithmetic coders as it encodes bits rather than whole numbers.
EBCOT works by encoding the coefficients of the DWT starting from the most-significant
bitplane to the least-significant using three separate passes. For each bit to be encoded,
the context and probability are needed. The context of a bit is determined by the
significance of its 8 coefficient neighbors. A coefficient is determined to be significant
if the bit in the current bitplane is a 1 or it was determined to be significant in the
previous pass. Depending on which of its neighbors are significant, the probability of a
bit is determined by a probability estimation table. Once the context and probability of
a bit is determined, it is sent to the MQ coder.
68
For even small images, EBCOT requires a large number of memory accesses as it must
access the coefficients of the transform many times for each of the passes. A consequence
of this is that it is very reliant on cache size and behavior in order to have short execution
times.
Photo Core Transform
The Photo Core Transform (PCT) is from the recent JPEG XR standard, which first
started life in Microsoft development under the name of “HD Photo.” The PCT
can be thought of to be an approximation of the DCT using a version of the lifting
scheme. Used here is the implementation adapted from the JPEG XR standard reference
implementation [45].
The implementation of the PCT is done using a combination of cascaded 2 by 2
Walsh-Hadamard transformations followed by a number of one-dimensional and two-
dimensional rotation operations. Consequently, there are few arithmetic operations to
be completed and these mainly consist of additions or subtractions.
The JPEG XR transform stage is designed to address some of the shortcomings of
the original JPEG algorithm. A main criticism of JPEG is the “blocking” artifacts
commonly experienced at lower-quality compression levels of JPEG. First, the image
is divided into macroblocks of 4-by-4 blocks. These blocks have a size of 4-by-4 pixels.
The PCT is first performed on each individual block followed by another application of
the PCT on the DC components of each block within a macroblock. This decreases the
perceptive effects of blocking artifacts.
JPEG XR Block Coding
The block coding method used by JPEG XR works by adaptively scanning coefficients of
the Photo Core Transform. The actual scan order depends on the probability of having
a non-zero coefficient which is determined by previous values in other blocks. This is in
69
contrast to JPEG’s fixed “zig-zag” coefficient scanning.
After the coefficients are scanned, the coefficients that are zero are encoded via run-
length encoding. The non-zero coefficients are encoded by an adaptive Huffman encoding
method. This results in a more computationally-demanding entropy encoding method
than JPEG with an end result of being more flexible.
JPEG XR coding, like Huffman coding, is reliant on memory accesses and therefore
needs large caches to have shorter execution times. Further, as the adaptive Huffman
coding relies on updating probabilities, this coding method has a greater reliance on
arithmetic hardware than simple Huffman coding.
Vector Quantization
Compression using vector quantization is a method for representing the initial set of
vectors using a smaller set of prototype vectors or codebook vectors. This process involves
two distinct steps. The first is codebook generation and the second is determining which
codebook vector to assign to an image block. Vector quantization was split into the
initial codebook generation stage (using the Generalized Lloyd Algorithm) and the final
distance calculations between the codebook and the blocks to be encoded.
The Generalized Lloyd Algorithm (GLA) [55] consists of the following iterative
steps [77]. The termination condition can either be for a fixed number of iterations or
it can terminate if there is little improvement to the quality of codebook entries. For
this work, GLA was run for a fixed number of iterations as it allows for fixed algorithm
parameters, as seen later.
0. Initialize codebook vectors.
• Random or
• Subset of initial training vectors
70
1. Determine set of partitions.
• Partition: A codebook entry and its “closest” training vectors
2. Calculate mean distortion within each partition.
• Mean distortion of each training vector and its codebook entry
3. Calculate final distortion and test termination conditions.
• Final distortion is mean distortion of all partitions
• Termination conditions: number of iterations or distortion threshold
4. If conditions are not met, return to Step 1 and generate new codebook vectors.
• New codebooks entries are the mean of all vectors in a partition
There are two parameters that can be set for vector quantization which affect quality,
compression ratio, and computational requirements. These are the codebook size NCB
and the vector size. A large codebook size will provide better quality, a lower compression
ratio, and a more computationally expensive compression process. A larger vector size
will provide a better compression ratio, generally lower quality, and a reduced number
of computations.
The part of GLA for calculating new codebook entries is similar to k-means clustering.
The consequence of this is that there is a large number of arithmetic and memory
operations which makes vector quantization heavily reliant on fast arithmetic hardware
and large cache sizes for fast execution.
Codebook generation using GLA is used as an “A” block whereas the final distance
calculations between image blocks and the codebook is used as a “B” block. The
implementation for both blocks was adapted from those in the QccPack [30].
71
Quad-Tree Fractal Compression
Fractal compression was first patented by Barnsley [4] and uses a mathematical construct
called an iterated function system. It was extended by Fisher [27] to include a method
of partitioning called quad-tree partitioning. A example of quad-tree partioning is shown
in Figure 4.5.
Figure 4.5: Example of quad-tree partioning
The set of ranges are mapped onto the domains, which are twice the size of the ranges.
The mapping is an affine transformation which minimizes the root-mean-squared (RMS)
difference between the transformed domain pixel values and the range pixel values. The
coefficients of this transformation and the optimal domain are stored for the range.
Neither the full details of the algorithm nor the selection of domains will be discussed
here.
The algorithm is split into the classification stage and the partioning stage. The
classification stage is when all domains are classified and the averages of 4 by 4 pixel
blocks of domains are computed (to make them the same size as ranges). The full details
of classification can be found in Fisher’s book [28]. Briefly, the classification of a range
72
or domain is based on the brightness of pixel values in each quadrant of this subimage.
That is, if ranges or domains have similar patterns in terms of pixel values, they will be
classified the same. A range will be encoded only to domains that share the same (or
similar) classification.
In the partioning stage, the encoding tree is constructed starting with the root node
and its four quadrants. Additional nodes are added if the RMS calculated during the
mapping is above a threshold or if the specified minimum depth of the tree has not yet
been reached.
The implementation of quad-tree fractal compression was adapted from Fisher’s
implementation [27].
WebP
Google’s WebP [35] image compression algorithm is broken into the analysis and the
compression blocks.
In the analysis stage, the algorithm begins by assessing each macroblock (which is 16
by 16 pixels) for its susceptibility to quantization; that is, how many bits of storage a
macroblock needs to be stored as the quantization level is changed. Macroblocks with
similar susceptibilities are grouped together in “segments.” After this, the macroblocks
are analyzed again for information about bit-rate and distortion. If no good rate-
distortion trade-off (called bit-cost) is found at the macroblock level, the macroblock is
further decomposed into blocks of 4 by 4 pixels. The rate-distortion trade-off is then
recalculated at the block level. Throughout this process, the statistics on coefficient
distribution and quantization levels are stored. In the analysis stage, no encoding is
done.
The compression stage involves using the statistics and knowledge gained from the
analysis stage for the actual coding of blocks.
For WebP, there is a compression parameter that drastically changes the computational
73
demands of the analysis and compression phases. This parameter determines the depth
of block analysis that is performed, the amount of rate-distortion optimization, and the
type of entropy encoding. Two versions of WebP were used in this modeling process.
This parameter ranges from 1 to 5 and the values chosen were 2 and 4.
These values were chosen as they represented both the diversity of WebP as well as
realistic choices for this parameter. A value of 1 typically results in very poor quality; a
value of 2 increases the resource demands by a small amount and provides better overall
compression. On the higher end, a value of 5 invokes a computationally-demanding
entropy encoder which would be unsuitable for embedded applications therefore the next
step down was taken.
All standard blocks of transform coding methods are shown in Table 4.2.
Table 4.2: Standard compression blocks used for modeling
Transform “A” Blocks Coding “B” Blocks
DCT Slow (8x8) Huffman coding
DCT Slow (4x4) EBCOT







This section describes the selection and extraction of algorithm parameters as well as the
available architecture parameters. The choice of algorithm parameters is done such that
they are high level and therefore easily extractable. These relate to high-level aspects
of the algorithm such as types of computations and an estimation of the number of
operations for both arithmetic and memory operations.
Also discussed are the architecture parameters of the soft processor. The chosen
parameters are limited to those that are available to change based on the particular soft
processor chosen. In this case, the Altera NIOS 2 was chosen as the soft processor.
4.2.1 Algorithm Parameters
The selection of parameters to represent an algorithm is crucial in the construction of an
accurate model. There is a close interaction between the algorithm being executed and
the architecture on which it is being executed. By having representative parameters,
the model can capture this interaction. At the same time, these parameters must be
high-level in representing the algorithm so they can be estimated prior to implementation.
This presents a challenge in selecting high-level parameters such that they can indicate
the algorithm’s dependency on the low-level processor architecture.
Two types of algorithm parameters have been chosen. The first (type I) are parameters
known to the designer that require knowledge only of the algorithm to be implemented.
The second (type II) are parameters that are estimated using knowledge of the algorithm




For the first type of parameters, the parameters chosen are shown in Table 4.3.
Table 4.3: Type I algorithm parameters
Parameter name Variable type
Presence of floating-point operations Binary
Arbitrary multiplication Binary
Arbitrary division Binary
Average working set size Discrete
Variable working set size Binary
Working set dependence Binary
Dependency set size Continuous
Dependency function arithmetic operations Continuous
Dependency function memory operations Continuous
Floating-point operations refers to the presence of a non-trivial amount floating-
point arithmetic operations. The definition of “non-trivial” will depend on the designer
and if they feel the number of floating-point operations is significant enough to affect
power and execution time. Some algorithms contain a trivial amount of floating point
instructions; from instrumenting an execution of EBCOT, it was found to contain 1,640
(0.004% of total) floating-point operations. By comparison, the floating-point DCT
contains 124,000 (7.5% of total) floating-point operations. As this is decided by the
designer, there is an aspect of human error. This issue is addressed later in the thesis.
As no floating-point hardware is considered in this work, the presence of floating-point
operations can increase the execution time of an algorithm if present as the operations
must be emulated in software.
Arbitrary multiplication and arbitrary division refer to the presence of arbitrary
(that is, non-2n) operations for multiplications and divisions.
The average working set size refers to the common set sizes that compression
algorithms work in. For instance, the DCT typically works on 8x8 pixel partitions of the
image. This value should give an indication on how much the algorithm will be affected
76
by both instruction and data cache sizes. The other related parameter is the variable
working set size. Some blocks, such as the DCT, always have the same working set
size. Others, such as EBCOT, have different sizes based on which stage the block is in.
The working set dependence parameter indicates whether or not there is a data
dependency between two separate working sets. In the DCT, there is no such dependency;
all 8 by 8 pixel blocks are computed separately with no interactions between any two
blocks. On the other hand, WebP is, at its core, intraframe prediction from video
compression which contains large amounts of data dependency between working sets.
Again, this parameter can show how much a block is affected by the size of the caches.
The data dependency function is how the blocks interact with each other and the nature
of this interaction. For instance, this function could be a distance calculation between
corresponding pixels, pixel comparison, and so on.
The dependency function f(Bx,y,S) has two parts: the dependency set size and
the dependency function operations. Here, Bx,y is the set of all image blocks to be
compressed and S is the set of blocks which are needed to compressBx,y. For example,
in the WebP algorithm the current block is predicted using the block to the left of the
current block and the three blocks above. The final value for this parameter would then
be cardinality of the set S which would be
|S| = 4×Nblock
This parameter can also be used to indicate iterations, such as in the vector quantization
case where the codebook generation requires multiple runs. If there are NCB entries in
the codebook, the dependency set size would be T ×NCB ×Nblocks for T iterations.
77
The operation parameters describe the number of operations that are required for the
single execution of the dependency function. As an example, consider the pseudocode
for a distance calculation from VQ, shown in Listing 4.1.
1
f l o a t euc lD i s tCa l c ( codebookBlock , imageBlock ) {
3
xCB = loadData ( codebookBlock ) ;
5 xIB = loadData ( imageBlock ) ;
7 d i s t = 0 ;
f o r ( i =0; i < X BLOCK SIZE ; i++ ) {
9 f o r ( j =0; j < Y BLOCK SIZE ; j++ ) {




re turn d i s t ;
15 }
Listing 4.1: Dependency Function Example
For one iteration of the inner loop, there are four memory accesses: two for elements
in the blocks, once to access the current value of dist and one to save the updated value.
Similarly, there are five arithmetic operations for the distance calculation and two for
the loop increments. For arithmetic operations, there is a total of
NA = 7× Y BLOCK SIZE ×X BLOCK SIZE ×Nblocks ×Nentries.
The total memory operations for the dependency function would be
NM = 4× Y BLOCK SIZE ×X BLOCK SIZE ×Nblocks ×Nentries.
78
Ultimately, these dependency functions are simple enough to know prior to implemen-
tation or can be implemented or done in pseudocode with little overhead.
Type II Parameters
Type II parameters use the designer’s knowledge to estimate parameters of their desired
algorithm against the previously mentioned algorithms that were used to generate the
model.
Type II parameters are shown in Table 4.4.
Table 4.4: Type II algorithm parameters
Parameter name Variable type
Total arithmetic operations Continuous
Total memory operations Continuous
Total operations Continuous
Ratio of divisions to arithmetic operations Continuous
Ratio of multiplications to arithmetic operations Continuous
Counts of total operations for memory and arithmetic operations are important
as they give an indication of the computational requirements of the algorithm block.
The number of total instructions is an automatically derived parameter from the sum
of memory and arithmetic operations.
The ratio of divisions to arithmetic operations and the ratio of multiplica-
tions to arithmetic operations gives the model an idea of the dependency of the
algorithm block on the higher latency arithmetic operations which can be affected by
the choice of specialized arithmetic hardware.
4.2.2 Soft-Processor Architecture Parameters
The selection of soft-processor architecture parameters are limited to those offered to
designers by the companies that provide them. The particular soft processor chosen for
this work is the Altera NIOS 2. However, the actual choice of soft processor is not likely
79
to affect the overall conclusions of this work; implementations of various soft processors
are likely to be similar enough not to cause results to differ by a significant amount. As
most soft processors are RISC-based processors [97] and share similar architectures, they
should exhibit similar performance using the same software. How this can be generalized
to hard processors is discussed in the following chapter.
The NIOS 2 comes in three different variations that differ in terms of available features
of the processor. These variations are named, from less to full featured, “e”, “s”,
and “f”. The NIOS 2/f variant offers useful features such as data caches, optional
hardware divide, and a longer datapath pipeline. This variant also features an optional
memory management unit (MMU) which, among other things, allows for virtual address
translation. The parameters that were examined are shown in Table 4.5.
Data and instruction cache sizes can range between 512 B and 64 kB in powers of 2.
However, the development board used has limited resources in terms of memory blocks.
This meant that the size of the caches were limited so that the sum of the two cache
sizes could not exceed roughly 41 kB or the NIOS 2 would exceed the resources of the
Cyclone III chip.
Cache behavior could also be changed by modifying the line size or through the use of
burst transfers. The instruction cache line size is fixed at 32 B while the data cache line
size can take the values of 4, 16, or 32 B. Burst transfers for the caches refer to how cache
lines are retrieved. Essentially, burst transfers control the addresses of memory that are
retrieved from main memory ensuring that entire cache lines are retrieved efficiently at
the expense of additional logic. In general, burst transfers increase memory bandwidth
in the cases of slower memory, such as DRAM [1].
The NIOS 2 gives the option of selecting the type of integer arithmetic hardware that
is used. For multipliers, the embedded multipliers on the FPGA can be used in addition
to a multiplier that is implemented in logic units. The other option is that multiply
instructions are emulated in software. Dividers can also be emulated in software in
80
addition to a logic unit-based hardware divider.






Instruction Cache size 7
cache Burst transfers 2
Multipliers Multiplier type 3
Division HW or SW division 2
Clock frequency 50 MHz
Total combinations 3,528
With board limitations 3,288
The NIOS 2 allows for additional parameters to be changed which include the addition
of tightly coupled caches, a Memory Management Unit (MMU), or a Memory Protection
Unit (MPU). Including tightly coupled caches increases the number of combinations
to 88,200, while including an MMU or MPU increases the number of combinations to
4 × 106 and 9 × 108, respectively. If tightly coupled caches and a MMU or MPU is
considered, the number of combinations becomes 1× 108 and 2× 1010, respectively. At
these numbers of combinations, the generation and handling of data becomes a problem
of its own. Variable clock rates are also not considered in this work.
There exists methods for sampling this space effectively [52, 61, 67, 105] but this is
beyond the scope of this work.
81
4.3 Conclusions
This chapter has described the variety of algorithms that will be used to generate the
high-level model. Each algorithm was decomposed into its basic building blocks that
consisted of computationally and logically distinct periods of execution.
From here, specific high-level parameters are extracted from these building blocks in
order to be used in the construction process. This was done by constraining the domain
where the algorithms could be taken from which allows aspects of these algorithms to be
represented by these high-level parameters. These parameters consisted of two types:
type I and type II. Type I parameters refer to parameters which only require knowledge
of the algorithm being predicted and consist of characteristics such as working set size
or arithmetic precision. Type II parameters require knowledge of other algorithms in
order to estimate parameters comparatively, such as the total number of operations.
Finally, the parameters available to designers when using soft processors were discussed.
Using these parameters, a model can now be generated which can predict the power
and performance of an image compression algorithm early in the design phase.
82
5 Construction of a High-Level
Performance Model
This chapter examines the construction and validation of a high-level performance model
using architecture parameters as well parameters extracted from the algorithm itself.
Bagged regression trees are used to create a predictive model using these high-level
parameters.
First, the method for collecting the data is discussed. This includes details such as the
equipment and applications used for the collection of power consumption and execution
time, respectively. Second, the performance of both the single- and dual-core models is
evaluated. Next, sensitivity analysis of the models is performed. This is important as
the power consumption and execution time models use parameters that are estimated
by the designer which makes them subject to human error. The chapter concludes with
an evaluation of the model performance on predicting the performance of variants of the
same block and how this applies to optimizations of algorithms.
83
5.1 Data Acquisition and Experiment Setup
The training data used in the construction of the predictive model is data measured
from the actual execution of algorithms running on either a single soft processor or
two soft processors. For each algorithm block, execution time and power consumption
measurements were taken on every combination of architecture parameters. This data is
subsequently used in the learning phase of the predictive model.
Power consumption measurements were obtained using sense resistors and a digital
multimeter (DMM). Execution time measurements were done using the NIOS 2 version
of gprof or Altera-provided hardware performance counters.
5.1.1 Soft Processor Configuration
In the single-core case, the NIOS 2 processor was configured to use the off-chip SDRAM
provided by the development board for both instruction and data memory. In the
dual-core case, each core was configured to use non-overlapping halves of the SDRAM
for instruction and data memory.
Scripts were created that generated all parameter combinations and the resulting
FPGA programming files. Additional scripts were used to automate the process of data
collection for each parameter combination.
5.1.2 Power Measurement
The development board used for the data collection is the Cyclone III Starter Kit.
It contains a Cyclone III EP3C25 FPGA in addition to SSRAM, SDRAM, and flash
memory. The board also contains components to assist in JTAG communication and
other I/O operations. To measure power, the development board offers two sense resistors
for measuring the current drawn by two of the power supply rails: the 1.2 V supply and
the 2.5 V supply. The FPGA core is the only device that uses the 1.2 V supply while
84
the 2.5 V supply is used by SSRAM, flash, SDRAM, and a small amount of I/O circuits.
The sense resistors for the power supplies both had a resistance of 10 mΩ. From here,
the 1.2 V core power supply will be known as the core power and the 2.5 V supply
used by the memory devices will be known as the off-chip device power or simply
device power.
A Keithley 199 DMM was used to measure the voltage across the sense resistors. It
was controlled via a PC using the IEEE-488 standard, typically called GPIB. It has a
resolution of 512 digits at a measured sampling rate of 45 Hz. Even though the NIOS 2
processor runs at 50 MHz and the SDRAM at 100 MHz, there is sufficient capacitance
on the sense resistor to create an averaging effect to obtain an accurate measurement.
An average is taken over the execution of the algorithm block to obtain the average
current drawn.
The distribution of measurements for multiple runs of the same algorithm will now
be discussed. This is to ensure that the values used to construct the models represent
an accurate measurement. Distributions of measurements will be discussed in terms
of deltas which is the range of the measurements for an algorithm block running on a
particular set of architecture parameters.
The core power consumption for three different blocks (DCT 8x8, DCT Float, and
Huffman coding) was recorded five times for each combination of architecture parameters.
The range of core power measurements for each block for each architecture combination
is shown in Figure 5.1. The figure shows that any power measurement can vary on
average by 5 mW to as much as 25 mW, which is between 1-8%. This number is small
enough that it is unlikely to cause problems.
The range of device power measurements for each block for each architecture combi-
nation is shown in Figure 5.2. The figure shows that any power measurement can vary
on average by 8 mW to as much as 51 mW, which is between 1-9%. Again, this number
is small enough that it is unlikely to cause problems.
85







Figure 5.1: Core power measurement delta distribution









Figure 5.2: Off-chip device power measurement delta distribution
86
5.1.3 Execution Time Measurement
The measurement of execution time was done using two different methods based on
how many processors were being used. Gprof was used for single-core executions while
hardware counters were used for dual-core executions.
Single-core
The NIOS 2 supports the use of gprof, the GNU profiler. Gprof works by sampling the
program counter to determine time spent in functions. Gprof works by injecting function
calls to measure execution time. Overhead introduced by gprof is detected by gprof
itself and does not include this in the final measurement. The other issue with gprof
is the sampling error. In practice, the sampling error of a function call is proportional
to the number of samples of a function times the sampling period (1 ms) [91]. The
algorithm blocks executed here are repeated multiple times for a total execution time of,
at least, several seconds to ensure that sampling error was within 1-5% of the measured
values.
However, there exists some statistical error in the measurements. The same process of
multiple measurements used in the core and device power case (discussed in Section 5.1.2)
was used for execution time. The distribution of sample measurements is shown in
Figure 5.3. For the three blocks being measured, the measurement can have a range of,
on average, 34 ms with a maximum of 210 ms. Even with the fastest executing blocks,
34 ms is roughly 8% of the total execution time.
87









Figure 5.3: Gprof time measurement distribution
Dual-core
Using gprof in the dual-core case caused problems that did not allow for the proper
execution when running both cores at the same time so an alternative solution had to be
used. This alternative was the use of hardware performance counters that are provided
by Altera. As additional LUTs were used for the hardware counter, additional power is
consumed and thus hardware counters incur power overhead. However, as the additional
LUTs number roughly 700 (about a 7% increase), any additional power consumption is
trivial.
Like the gprof -based method, there exists some statistical error in the measurements.
This is shown in Figure 5.4. This shows both the total distribution and then the
distribution of deltas under 100 ms. The total distribution shows nearly all deltas are
small with only a small number of deltas with values greater than 400 ms. These values
are likely caused by runs that experienced errors; as discussed previously, the dual-core
NIOS 2 systems experienced a number of difficult-to-troubleshoot issues that prevented
88
proper execution. However, the sub-100 ms delta histogram shows that the differences
using the performance counter, when properly executed, are very small with an average
delta of 0.8 ms.







(a) All measurement deltas






(b) Measurement deltas under 100 ms
Figure 5.4: Performance counter time measurement distribution
89
5.1.4 Algorithm Block Execution
To obtain an accurate power consumption and execution time measurements, the
algorithm blocks needed to be executed for, at least, a couple of seconds. However, as a
number of blocks are fast in executing, they needed to be executed a number of times.
To do this, care had to be taken to flush the caches between iterations. The pseudocode
and comments for the algorithm execution is shown in Listing 5.1.
1
// Wil l d i f f e r based on average block execut ion time
3 ITERATIONS = 10 ;
5 i n t main ( ) {
7 // Load in image f o r trans form blocks or c o e f f i c i e n t s f o r coding b locks
x = loadData ( ) ;
9
f o r ( i =0; i < ITERATIONS; i++ ) {
11
// Flush data cache
13 a l t d c a c h e f l u s h a l l ( ) ;
15 // Flush i n s t r u c t i o n cache
a l t i c a c h e f l u s h a l l ( ) ;
17
// Star t ac tua l main c a l l
19 block main (x ) ;
21 }
re turn 0 ;
23 }
Listing 5.1: Main function framework
90
5.1.5 Training Data Generation: Single-core
Every algorithm block is executed on every combination of architecture parameters
(Table 4.5) and its execution time and power consumption measured for each execution.
From here, the measurements from separate blocks were combined. First, the com-
bination of an “A” block and a “B” block (Table 4.2) is taken. For each combination
of architecture parameters, the mean of their power consumption is taken as well as
the addition of their execution times. Only those blocks listed in Table 4.2 are used
in combination; vector quantization, fractal compression, and WebP blocks are not
combined with other methods.
Each combination of algorithm blocks therefore generates N training vectors for
each of the three metrics where N is the total number of combinations of architecture
parameters. Ultimately, an individual training vector will consist of the parameters of
the blocks and the parameters of the architecture that the blocks were executed on.
An additional set of parameters are added to the training vectors to assist in training.
These additional parameters are the sums of the corresponding parameters for each
block. In this way, the training data incorporates the combination of parameters instead
of each parameter on its own.
For the single-core case, this means there will be 3,288 input vectors per algorithm
combination consisting of 7 architecture parameters, 12 block A parameters, 12 block B
parameters, and 12 block A+B parameters.
5.1.6 Training Data Generation: Dual-core
The generation of training data for the dual-core model construction is identical to the
single-core case with only a few differences.
First, the dual-core case will have the same number of variables per vector but instead
have 1,088 input vectors compared to the 3,288 for the single-core case. The difference is
due to the number of available memory blocks on the FPGA and each processor having
91
the same sized caches. Execution time for a block is the maximum of the execution
times from both processors.
The only other difference is in the WebP algorithm. WebP method 2 was used in
the single-core case but was changed to WebP method 1 in the dual-core case. WebP
method 2 was causing some unknown problems to occur when running concurrently on
two processors that would not allow proper execution of the algorithm.
92
5.2 Model Validation for Single-Core Prediction Model
This section looks at the validation results of the model and its associated error in
predicting unseen algorithms. First, cross-validation–a type of model assessment–is
discussed. This is followed by the results of applying this cross-validation to the data
collected by the methods outlined in Section 5.1.
The accuracy of the single-core model is discussed first. The following section examines
the accuracy of the dual-core model.
5.2.1 Validation Method
For the final version of the model, all combinations of algorithm blocks would be used to
generate it. However, there is a need here to perform validation on the model to have an
idea of how well it can generalize to an unseen algorithm. To this end, cross-validation
is used.
Cross-validation [48] is a method that tests how well a predictive model can generalize
to an unknown test set. k-fold cross-validation works by selecting k subsets of the data
and then generating the model using k− 1 subsets while using the remaining subset as a
test set. This is repeated K times.
In the case of models generated here, k is the number of algorithm combinations as
discussed in the previous section. This work will present the results along the result of
each fold so the model can be evaluated for its predictive performance for any given
algorithm. The idea is that the folds were chosen at a high-level of abstraction and
therefore the analysis must be done at this high-level as well; it will be seen that the
algorithm being predicted has a large effect on the accuracy of the model.
One issue that must be addressed here is the fact that regression forests are unable to
perform extrapolation. Because of this, care must be taken when including algorithms
in error analysis whose responses are at the lower and upper limits; models generated
93
attempting to predict these extreme values will invariably perform poorly. If the extreme
values are shared by a number of algorithms then the inability of the model to extrapolate
is not a problem.
For core power, the distribution of all measured values are shown in Figure 5.5. In
the tails of the distribution, the values are from a number of different algorithms. The
same is true for off-chip device power, shown in Figure 5.6. In these cases, the exclusion
of certain algorithms will not be a problem.






Figure 5.5: Single-core: Core power measurement distribution
94








Figure 5.6: Single-core: Device power measurement distribution
The measured values for execution time across all algorithms are shown in Figure 5.7.
While there is a high concentration of values in short execution times, there is a very
long tail which contains the execution times from a small number of algorithms.
Execution times are much dispersed than those of core or device power measurements.




IQR = Q3 −Q1
where Q1 and Q3 are the first and third quartiles and IQR is the interquartile range.
The QED is useful as it provides a scale-invariant metric for dispersion. For core and
device power measurements, the QED is 0.05 and 0.04, respectively. For execution time,
95









Figure 5.7: Single-core: Execution time measurement distribution
this value is 0.76, indicating a data set that is much more dispersed. What this means
for regression trees is that problems can arise due to lack of data when characterizing
less dense regions.
The longest execution times belong to a single algorithm: the fractal compression
algorithm. The distribution of algorithms in the tail, starting at 6× 104 ms, are shown
in Table 5.1. Further, fractal compression is the only algorithm to have any execution
times over 13× 104 ms. While fractal compression has the longest execution times, the
other algorithms in Table 5.1 can be expected to have higher levels of error due to this
same issue.
Table 5.1: Execution time: Membership of algorithms in “tail”
Algorithm Percent of samples
Fractal 61%
Vector quantization 28%
WebP method 4 11%
96
For validation, both absolute and relative error values will be used. Absolute error A
for a predicted value p and a measured value m is defined as
A = p−m.





In addition to these, the coefficient of determination, R2, value is also given so the
model performance can be evaluated. The R2 is a measure to show how well a given
model fits a set of data. If y is the set of measured values in the test set, y¯ is the mean of
all measured values in the test set, and yˆ is the set of predicted values, R2 is defined as
R2 = 1− SSR
SST
where SSR is the residual sum of squares and SST is the total sum of squares. These









For non-linear regression methods, such as regression trees, R2 is in the interval
(−∞, 1]. A negative R2 value indicates that using the mean of the measured values
(instead of the predicted values) results in a better fit than the model being examined.
As the models are dealing with estimated values, this error is actually the residual
error. However, for brevity any reference to error in the model evaluation process should
be taken to mean the residual error of the test data.
97
Histograms will be presented below that show both the absolute and relative error in
predicting all test sets using cross-validation. These test sets consist of a test vector for
every architecture combination and the parameters of the algorithm being tested.
For each performance metric, the analysis begins by examining individual folds of
the training data. This is followed by an examination of the distribution of error for
each metric along with the empirical Cumulative Density Function (CDF). Finally, the
contribution of each variable to the reduction of the model error will be shown.
5.2.2 FPGA Core Power
FPGA core power represents the total power consumption of all LUTs, registers, multi-
pliers, and memory blocks being used on the FPGA core during execution. The value
being predicted by this model is the average current drawn over the entire execution of
the algorithm.
The response variable is power consumed by the core. However, the value taken from
the multimeter is the voltage across the 10 mΩ sense resistor. So for a meter reading of







= 0.259 A = 259 mA.
The actual core power would be
P = IV = (259 mA)× (1.2 V) = 311 mW.
The full overview of the error for each fold in cross-validation, as well as the mean
error, is shown in Table 5.2.
This table shows that predicting an algorithm’s core power consumption is accurate
with an overall mean absolute error of only 2.47 mW, with a minimum error of 1.35 mW
and maximum of 5.20 mW. The R2 values are all close to 1 and range from 0.82 to 0.99
98
Table 5.2: Single core: Core power per algorithm fold performance
Algorithm Mean |A| Mean |R| R2
DCT 4x4+EBCOT 1.5 0.8% 0.98
DCT 8x8+EBCOT 1.4 0.7% 0.99
DCT Float+EBCOT 2.2 1.1% 0.97
DCT Fast+EBCOT 1.7 0.8% 0.98
DCT Float+Huffman 2.3 1.2% 0.97
DCT Fast+Huffman 1.8 0.9% 0.98
DCT 4x4+Huffman 1.5 0.8% 0.98
DCT 8x8+Huffman 1.4 0.7% 0.99
DCT 4x4+JXR 1.5 0.8% 0.98
DCT 8x8+JXR 1.5 0.8% 0.98
DCT Float+JXR 2.4 1.3% 0.97
DCT Fast+JXR 2.0 1.0% 0.97
DWT Irr.+EBCOT 3.0 1.5% 0.94
DWT Irr.+Huffman 2.9 1.5% 0.93
DWT Irr.+JXR 3.1 1.6% 0.93
DWT Rev.+EBCOT 3.0 1.6% 0.92
DWT Rev.+Huffman 2.5 1.3% 0.95
DWT Rev.+JXR 3.2 1.7% 0.93
Fractal 3.9 1.9% 0.92
PCT+EBCOT 2.3 1.1% 0.96
PCT+Huffman 2.0 1.0% 0.97
PCT+JXR 1.9 1.0% 0.97
Vector Quantization 5.2 2.7% 0.82
WebP (method 2) 3.8 2.0% 0.93
WebP (method 4) 3.7 1.8% 0.94
Mean 2.5 1.3%
which indicate that all models fit the test data well.
The histograms of prediction error for all models are shown in Figure 5.8.
Similarly, the empirical CDFs of relative and absolute error are shown in Figure 5.9.
This empirical CDF shows the concentration of error values. It shows that 80% of
predictions have error under 4 mW (1.99% relative error) and 90% of predictions are
under 5.5 mW (2.79% relative error). Further, this shows that 98% of the predictions
have absolute error under 10 mW.
An important aspect of regression trees is their ability to show the most influential
99











Figure 5.8: Error histograms of test sets for single-core power predictions for absolute
error (a) and relative error (b)






















Figure 5.9: Empirical CDFs of test sets for single-core power predictions for absolute
error (a) and relative error (b)
parameters based on their importance in the tree splits. The top twenty variables in
terms of importance are shown in Table 5.3. What the table shows is the average change
in mean squared error (MSE) for the top 20 variables in descending order. The variables
at the top indicate those which contribute the most to decrease in the error of the model
in fitting the training set. The table shows that core power is most influenced by the
architecture parameters, with the top 6 being comprised of all architecture parameters.
100
In terms of algorithm parameters, arithmetic aspects are the more important variables.
As most algorithms use a transform, this suggests that the characteristics of this transform
and its relationship to the underlying arithmetic hardware has a large affect on core
power consumption. The importance of the actual number of arithmetic operations can
only be speculated about; having a large number of arithmetic operations suggests the
datapath is using the ALUs constantly which is increasing dynamic power consumption.
In terms of framework generalization, core power is very dependent on the underlying
technology. Larger cache sizes and LUT-based multipliers, for example, require more
logic elements and interconnect which increases power consumption. Only the slight
differences between the different soft processors would need to be taken into account in
order to make this model more general. While this stays true for other programmable
logic devices (such as CPLDs), it is not the case for hard processors where this is more
complicated relationship between cache sizes and implementation details. However, core
power can give an indication of the activity level of pipeline components which is largely
processor-independent. Using this information, parameters could be inferred that would
allow the use of Function Level Power Analysis (FLPA) methods in addition to the
model presented here to predict core power consumption of hard processors.
101
Table 5.3: Single core: Core power ensemble variable importance
Group Variable name ∆ MSE
Architecture Data cache size 5.62× 10−7
Architecture Inst. cache size 5.34× 10−8
Architecture Data cache line Size 4.64× 10−8
Architecture Multiplier type 2.94× 10−8
Architecture Divider type 5.58× 10−9
Architecture Inst. cache burst transfers 4.16× 10−9
Block B Average working set size 4.03× 10−9
Block A Multiplication ratio 3.91× 10−9
Block A Arithmetic operations 2.61× 10−9
Block B Arithmetic operations 2.57× 10−9
Architecture Data cache burst transfers 2.37× 10−9
Block A Average working set size 2.32× 10−9
Total Multiplication ratio 2.24× 10−9
Block A Precision 2.19× 10−9
Total Arithmetic operations 1.91× 10−9
Block B Memory operations 1.25× 10−9
Total Memory operations 1.06× 10−9
Total Average working set size 7.52× 10−10
Total Total operations. 7.11× 10−10
Block B Multiplication ratio 6.55× 10−10
5.2.3 Off-chip Device Power
The device power is the power consumed by a number of off-chip devices on the
development board. It is measured using the sense resistor on the 2.5 V power supply.
The 2.5 V supply powers the DDR RAM, the SSRAM, the flash memory, and some
small I/O circuits. For this work, the SSRAM and the flash were not used. Ultimately,
this device power represents how the activity of the soft processor relates to memory
usage and access to off-chip devices.
Similar to core power, the response variable is power consumed by the off-chip devices.
However, the value taken from the multimeter is the voltage across the 10 mΩ sense
resistor.
102







= 0.259 A = 259 mA.
The actual device power would be
P = IV = (259 mA)× (2.5 V) = 647 mW.
The full overview of the various errors for each fold in cross-validation is shown in
Table 5.4.
Table 5.4: Single core: Device power per algorithm fold performance
Algorithm Mean |A| Mean |R| R2
DCT 4x4+EBCOT 6.1 1.1% 0.90
DCT 8x8+EBCOT 5.5 1.0% 0.94
DCT Float+EBCOT 11.6 2.1% 0.84
DCT Fast+EBCOT 7.8 1.4% 0.87
DCT Float+Huffman 9.0 1.7% 0.85
DCT Fast+Huffman 6.1 1.1% 0.87
DCT 4x4+Huffman 5.4 1.0% 0.89
DCT 8x8+Huffman 5.2 1.0% 0.92
DCT 4x4+JXR 4.7 0.9% 0.94
DCT 8x8+JXR 4.9 0.9% 0.95
DCT Float+JXR 8.2 1.5% 0.91
DCT Fast+JXR 6.2 1.2% 0.90
DWT Irr.+EBCOT 15.4 2.7% 0.71
DWT Irr.+Huffman 14.8 2.7% 0.53
DWT Irr.+JXR 8.7 1.6% 0.89
DWT Rev.+EBCOT 12.8 2.3% 0.30
DWT Rev.+Huffman 7.4 1.4% 0.63
DWT Rev.+JXR 12.8 2.4% 0.29
Fractal 8.8 1.6% 0.89
PCT+EBCOT 7.0 1.2% 0.88
PCT+Huffman 5.1 0.9% 0.90
PCT+JXR 5.7 1.0% 0.91
Vector Quantization 18.0 3.4% 0.57
WebP (method 2) 7.6 1.4% 0.89
WebP (method 4) 8.5 1.5% 0.88
Mean 8.5 1.6%
103











Figure 5.10: Error histograms of test sets for single-core device power predictions for
absolute error (a) and relative error (b)
The CDFs of relative and absolute error are shown in Figure 5.11. It shows that 80%
of predictions have error under 13 mW (2.40% relative error) and 90% of predictions are
under 18.5 mW (3.36% relative error).






















Figure 5.11: Empirical CDFs of test sets for single-core device power predictions for
absolute error (a) and relative error (b)
Here, device power consumption are still accurate with an overall mean absolute error
of only 8.53 mW, with a minimum error of 4.71 mW and a maximum of 17.96 mW.
The R2 values show that the model does not fit the data as well as the core power model
104
and range from 0.29 to 0.90.
The top twenty variables in terms of importance are shown in Table 5.5. Similar to
core power, cache sizes are again the most important variables. However, core power
consumption increases with a larger cache size. Device power has an inverse relationship
to cache size: larger caches means off-chip devices are accessed less often (due to fewer
cache misses) and therefore consume less power. Other cache parameters, such as burst
transfers and line size have a smaller, but still important effect on the device power
consumption.
Unlike core power, off-chip device power consumption is virtually independent of the
implementation of the processor itself, whether as a soft-core or a hard-core processor.
The off-chip device power model is therefore dependent on the memory technology. The
Cyclone III starter kit uses SDRAM which is typical in many embedded situations,
making the off-chip device model generalizable to many system configurations regardless
of processor implementation.
105
Table 5.5: Single core: Off-chip device power ensemble variable importance
Group Variable name ∆ MSE
Architecture Inst. cache size 2.08× 10−7
Architecture Data cache size 1.19× 10−7
Architecture Inst. cache burst transfers 4.46× 10−8
Architecture Data cache line Size 3.54× 10−8
Total Total operations. 3.36× 10−8
Architecture Data cache burst transfers 2.47× 10−8
Architecture Multiplier type 1.72× 10−8
Total Arithmetic operations 1.39× 10−8
Block A Precision 9.49× 10−9
Total Memory operations 7.38× 10−9
Block A Arithmetic operations 6.85× 10−9
Block A Multiplication ratio 5.32× 10−9
Block B Average working set size 4.98× 10−9
Architecture Divider type 4.84× 10−9
Block B Arithmetic operations 4.73× 10−9
Block A Average working set size 4.16× 10−9
Block A Total operations 3.49× 10−9
Block B Multiplication ratio 3.43× 10−9
Block B Memory operations 2.12× 10−9
Total Average working set size 2.04× 10−9
5.2.4 Execution Time
For this work, execution time is considered as the sum of the execution times for both
algorithm blocks.
An overview of the performance on each of the folds is show in Table 5.6. As discussed
above, fractal compression has the longest running times and therefore poor performance
is inevitable as the model is unable to perform extrapolation.
The histograms for prediction error for all models are shown in Figure 5.12.
The CDFs of relative and absolute error are shown in Figure 5.13. The CDF shows
that 80% of the predictions have error under 2223 ms (28% relative error) and 90% of
the predictions under 4532 ms (51% relative error). By not including fractal compression
in these statistics, the 80% and 90% marks fall to 1874 ms (24% relative error) and
3361 ms (39% relative error), respectively.
106
Table 5.6: Single core: Execution time per algorithm fold performance
Algorithm Mean |A| Mean |R| R2
DCT 4x4+EBCOT 70 1.0% 1.00
DCT 8x8+EBCOT 60 0.9% 1.00
DCT Float+EBCOT 2495 24.4% 0.56
DCT Fast+EBCOT 1266 16.8% 0.76
DCT Float+Huffman 2498 76.7% −2.44
DCT Fast+Huffman 52 8.0% 0.81
DCT 4x4+Huffman 64 9.5% 0.70
DCT 8x8+Huffman 42 5.9% 0.88
DCT 4x4+JXR 64 3.7% 0.98
DCT 8x8+JXR 47 2.7% 0.99
DCT Float+JXR 1017 24.8% 0.67
DCT Fast+JXR 56 3.4% 0.99
DWT Irr.+EBCOT 1694 18.0% 0.74
DWT Irr.+Huffman 424 15.8% 0.78
DWT Irr.+JXR 1429 40.4% −0.01
DWT Rev.+EBCOT 205 2.8% 0.99
DWT Rev.+Huffman 76 11.0% 0.45
DWT Rev.+JXR 60 3.9% 0.99
Fractal 55 348 88.3% −2.80
PCT+EBCOT 83 1.1% 1.00
PCT+Huffman 71 11.3% 0.34
PCT+JXR 76 4.3% 0.98
Vector Quantization 10 454 31.6% 0.73
WebP (method 2) 5565 38.4% −0.09
WebP (method 4) 3716 12.3% 0.91
Mean 3477 18.3%
The top twenty variables in terms of importance are shown in Table 5.7. The top
variables are those associated with arithmetic operations. The number of arithmetic
operations for both blocks, along with the multiplier type, show that this is one of the
more important aspects of being able to predict the execution time of an algorithm.
Though having an overall relative accuracy of 18.3%, the model exhibits high levels of
error in four cases: Floating-point DCT+Huffman, Irreversible DWT+JPEG XR coding,
fractal compression, and WebP method 2.
As discussed previously, the high error with fractal compression can be explained by
107

















Figure 5.12: Error histograms of test sets for single-core execution time predictions for
absolute error (a) and relative error (b)
























Figure 5.13: Empirical CDFs of test sets for single-core execution time predictions for
absolute error (a) and relative error (b)
the fact that the ensemble is attempting to extrapolate. Ultimately, this should not
be an issue for users of this framework as it is unlikely that users will want to have
an algorithm with a longer execution time than fractal; the longest execution time for
fractal compression is 160× 103 ms for a single 128-by-128 pixel image.
The error in the other cases is due to the relationship between variable importance
and the algorithm being predicted. For instance, the floating-point DCT has very similar
parameters to the other DCT variants. When excluding it from the construction process,
108
Table 5.7: Single core: Execution time ensemble variable importance
Group Variable name ∆ MSE
Block B Arithmetic operations 9.68× 103
Architecture Multiplier type 1.75× 103
Architecture Inst. cache size 1.66× 103
Block A Arithmetic operations 4.53× 102
Block A Precision 4.11× 102
Architecture Data cache size 2.33× 102
Block A Dependency set size 2.26× 102
Block B Multiplication ratio 1.96× 102
Block A Variable working set size 1.73× 102
Total Memory operations 1.64× 102
Architecture Data cache line Size 8.11× 101
Total Total operations. 8.07× 101
Block A Average working set size 5.64× 101
Total Arithmetic operations 5.22× 101
Block A Memory operations 3.18× 101
Architecture Divider type 1.42× 101
Block A Total operations 3.53× 100
Block A Multiplication ratio 3.47× 100
Block B Average working set size 3.32× 100
Architecture Inst. cache burst transfers 4.49× 10−1
the arithmetic precision parameter becomes less well-defined. From Table 5.7, it
can be seen that both arithmetic operations and the precision of Block A is important
in predicting execution time. In fact, higher levels of error can be seen in all of the
algorithms that contain the floating-point DCT. This same phenomena is the cause for
high levels of error for different algorithms on different parameters. Ultimately, this is
an effect of the chosen validation process. In practice, these ill-defined parameters can
be determined using node statistics and the outlier measure which maintains the model’s
feasibility in these cases.
Models for execution time are highly dependent on the implementation of a processor,
either as a soft or hard processor. Execution time is reported here as “wall-clock” time
in milliseconds; expressing this as a clock cycle count means the model can be made
more general. Generalization of the model would then rely on the non-trivial task of
109
characterizing the differences in clock cycle latency of components in the pipeline. The
latency due to off-chip memory devices would not be affected by processor choice but
still plays an important role in determining execution time. Because of this, the latency
of other off-chip memory technologies would need to be characterized in order for the
model to be generalized to other systems.
110
5.3 Model Validation for Dual-Core Prediction Model
Similar to the single-core case, the results of the dual-core modeling validation process
are shown here.
5.3.1 Validation method
The validation for the dual-core case is identical to that of the single-core case. However,
moving from a single soft processor to a dual processor has a number of implications. In
terms of core power, this means a doubling in the number of LUTs (and other FPGA
elements) used for the soft processor which means an increase in the static and dynamic
power consumption. The distribution of measured core power values are shown in
Figure 5.14.









Figure 5.14: Dual core: Core power measurement distribution
For off-chip device power, the addition of another processor means that the I/O
and memory devices will be handling twice as much data as the single-core case. The
111
distribution of measured device power values are shown in Figure 5.15.






Figure 5.15: Dual core: Device power measurement distribution
Execution time is affected the most by the addition of another processor. Having a
second processor essentially reduces the workload of both processors by half, decreasing
execution time at the expense of area and power. While there are many configurations of
dual-core processors, only one configuration is being considered here. The characteristics
of this configuration are
• homogeneous processors;
• SIMD processing (each processes half of the image); and
• shared (non-overlapping) SDRAM.
As each processor is assigned half of the image to compress, it may be simple to say
that the execution time will be half that of the single-core case. However, this is not the
case for two, albeit competing, reasons.
112
The first is the shared SDRAM. Cache misses from both processors will be handled
by a single SDRAM controller leading to a potential increase in the penalty for a cache
miss compared to that of the single-core case.
The second issue is the complexity of algorithms. The execution time of the DCT, for
example, scales linearly to the number of blocks that need to be transformed. However,
WebP and vector quantization rely on distance calculations and comparisons with other
blocks in the image; these methods have polynomial complexity proportional to the
number of blocks in the image being compressed.
The distribution of measured execution time values are shown in Figure 5.16.









Figure 5.16: Dual core: Execution time measurement distribution
Compared to the execution time distribution for single-core values (Figure 5.7), the
dual-core case has a smaller tail which contains only three different algorithms: fractal
compression, vector quantization, and WebP (method 4). As the tail is smaller, the
regression tree will be less well-defined for these regions which will ultimately result in
113
higher error for these methods. Fractal compression, like in the single-core case, contains
the longest execution times.
5.3.2 FPGA Core Power
The per-fold algorithm performance for the prediction of FPGA core power in the
dual-core setup are shown in Table 5.8. The R2 values from the table range from 0.31 to
0.97.
Table 5.8: Dual core: Core power per algorithm fold performance
Algorithm Mean |A| Mean |R| R2
DCT 4x4+EBCOT 2.2 0.8% 0.97
DCT 8x8+EBCOT 2.5 0.9% 0.97
DCT Float+EBCOT 5.0 1.8% 0.90
DCT Fast+EBCOT 2.2 0.8% 0.97
DCT Float+Huffman 4.5 1.6% 0.87
DCT Fast+Huffman 2.9 1.1% 0.93
DCT 4x4+Huffman 2.9 1.1% 0.92
DCT 8x8+Huffman 2.9 1.1% 0.93
DCT 4x4+JXR 2.4 0.9% 0.96
DCT 8x8+JXR 2.7 1.0% 0.95
DCT Float+JXR 4.5 1.7% 0.88
DCT Fast+JXR 2.0 0.8% 0.96
DWT Irr.+EBCOT 7.7 2.8% 0.77
DWT Irr.+Huffman 6.6 2.4% 0.68
DWT Irr.+JXR 6.3 2.4% 0.76
DWT Rev.+EBCOT 8.3 3.2% 0.56
DWT Rev.+Huffman 7.5 2.9% 0.49
DWT Rev.+JXR 7.1 2.8% 0.61
Fractal 7.7 2.8% 0.79
PCT+EBCOT 5.4 2.0% 0.85
PCT+Huffman 4.2 1.6% 0.86
PCT+JXR 4.1 1.5% 0.89
Vector Quantization 12.0 4.7% 0.30
WebP method 1 5.6 2.1% 0.78
WebP method 4 5.1 1.9% 0.86
Mean 5.0 1.9%
The error histograms for the dual-core predictions are shown in Figure 5.17. Recall
114
that WebP method 1 was used instead of method 2 (as in the single-core case) due to
run-time issues.











Figure 5.17: Error histograms of test sets for dual-core power predictions for absolute
error (a) and relative error (b)
The CDFs of prediction error are shown in Figure 5.18. It shows that 80% of predictions
have error under 8 mW (2.98% relative error) and 90% of predictions are under 11.5 mW
(4.26% relative error).






















Figure 5.18: Empirical CDFs of test sets for dual-core power predictions for absolute
error (a) and relative error (b)
The top twenty variables in terms of importance are shown in Table 5.9.
115
Table 5.9: Dual core: Core power ensemble variable importance
Group Variable name ∆ MSE
Architecture Data cache size 1.28× 10−6
Architecture Inst. cache size 2.88× 10−7
Architecture Multiplier type 2.00× 10−7
Block A Precision 1.63× 10−7
Architecture Data cache line Size 1.13× 10−7
Architecture Divider type 6.01× 10−8
Architecture Data cache burst transfers 4.96× 10−8
Block B Average working set size 4.06× 10−8
Block A Multiplication ratio 4.04× 10−8
Block A Arithmetic operations 4.02× 10−8
Total Arithmetic operations 3.72× 10−8
Block A Average working set size 2.93× 10−8
Total Average working set size 2.40× 10−8
Total Multiplication ratio 2.20× 10−8
Total Memory operations 1.82× 10−8
Block B Arithmetic operations 1.30× 10−8
Block A Dependency set size 9.75× 10−9
Total Total operations. 9.06× 10−9
Block B Multiplication ratio 6.97× 10−9
Block A Variable working set size 6.44× 10−9
5.3.3 Off-Chip Device Power
The per-fold algorithm performance for the prediction of off-chip device power in the
dual-core setup are shown in Table 5.10. TheR2 values from the table range from 0.15
to 0.95.
The error histograms for the dual-core device power predictions are shown in Fig-
ure 5.19.
The CDFs of prediction error are shown in Figure 5.20. It shows that 80% of predictions
have error under 27 mW (4.40% relative error) and 90% of predictions are under 40 mW
(6.52% relative error).
The top twenty variables in terms of importance are shown in Table 5.11.
116
Table 5.10: Dual core: Device power per algorithm fold performance
Algorithm Mean |A| Mean |R| R2
DCT 4x4+EBCOT 8.1 1.3% 0.94
DCT 8x8+EBCOT 8.3 1.3% 0.95
DCT Float+EBCOT 13.5 2.2% 0.95
DCT Fast+EBCOT 9.6 1.6% 0.93
DCT Float+Huffman 30.8 5.1% 0.46
DCT Fast+Huffman 23.3 4.1% 0.27
DCT 4x4+Huffman 13.0 2.3% 0.71
DCT 8x8+Huffman 10.2 1.8% 0.86
DCT 4x4+JXR 9.1 1.5% 0.90
DCT 8x8+JXR 8.5 1.4% 0.92
DCT Float+JXR 12.1 2.0% 0.93
DCT Fast+JXR 6.9 1.1% 0.94
DWT Irr.+EBCOT 20.5 3.1% 0.84
DWT Irr.+Huffman 30.0 4.9% 0.30
DWT Irr.+JXR 21.4 3.3% 0.75
DWT Rev.+EBCOT 10.8 1.8% 0.88
DWT Rev.+Huffman 20.2 3.6% 0.19
DWT Rev.+JXR 12.2 2.0% 0.75
Fractal 22.1 3.5% 0.83
PCT+EBCOT 11.3 1.8% 0.92
PCT+Huffman 14.8 2.5% 0.66
PCT+JXR 9.6 1.5% 0.90
Vector Quantization 39.9 6.9% 0.21
WebP method 1 23.1 3.9% 0.46
WebP method 4 33.9 5.3% 0.24
Mean 16.9 2.8%
117










Figure 5.19: Error histograms of test sets for dual-core device power predictions for
absolute error (a) and relative error (b)






















Figure 5.20: Empirical CDFs of test sets for dual-core device power predictions for
absolute error (a) and relative error (b)
118
Table 5.11: Dual core: Off-chip device power ensemble variable importance
Group Variable name ∆ MSE
Architecture Inst. cache size 2.90× 10−6
Block B Average working set size 3.29× 10−7
Block A Precision 2.58× 10−7
Architecture Data cache size 2.33× 10−7
Total Memory operations 2.07× 10−7
Architecture Data cache line Size 2.06× 10−7
Architecture Multiplier type 1.60× 10−7
Architecture Data cache burst transfers 1.56× 10−7
Block B Arithmetic operations 1.45× 10−7
Block B Total operations. 1.43× 10−7
Block A Multiplication ratio 1.02× 10−7
Block B Memory operations 7.65× 10−8
Total Total operations. 7.44× 10−8
Total Arithmetic operations 6.99× 10−8
Block A Arithmetic operations 6.31× 10−8
Block A Average working set size 5.01× 10−8
Total Multiplication ratio 4.90× 10−8
Architecture Divider type 3.21× 10−8
Block A Total operations 2.02× 10−8
Block B Dependency function 1.30× 10−8
119
5.3.4 Execution Time
This section shows the accuracy of the constructed model for the prediction of execution
times on a dual NIOS 2 system. The per-fold algorithm performance for the prediction
of execution time in the dual-core setup are shown in Table 5.12.
Table 5.12: Dual core: Execution time per algorithm fold performance
Algorithm Mean |A| Mean |R| R2
DCT 4x4+EBCOT 25 0.8% 1.00
DCT 8x8+EBCOT 36 1.1% 1.00
DCT Float+EBCOT 1193 24.6% 0.60
DCT Fast+EBCOT 616 17.9% 0.77
DCT Float+Huffman 1208 76.1% −2.16
DCT Fast+Huffman 25 9.0% 0.73
DCT 4x4+Huffman 23 6.9% 0.77
DCT 8x8+Huffman 38 9.8% 0.60
DCT 4x4+JXR 24 2.9% 0.98
DCT 8x8+JXR 36 4.1% 0.97
DCT Float+JXR 471 26.5% 0.70
DCT Fast+JXR 27 3.8% 0.98
DWT Irr.+EBCOT 858 19.7% 0.74
DWT Irr.+Huffman 126 11.6% 0.93
DWT Irr.+JXR 676 38.9% −0.08
DWT Rev.+EBCOT 255 5.8% 0.93
DWT Rev.+Huffman 35 10.2% 0.58
DWT Rev.+JXR 369 42.7% −2.39
Fractal 17 437 81.3% −2.23
PCT+EBCOT 35 1.1% 1.00
PCT+Huffman 35 11.8% 0.33
PCT+JXR 31 4.0% 0.97
Vector Quantization 4522 34.8% 0.25
WebP method 1 6852 169.6% −38.54
WebP method 4 4353 30.4% 0.25
Mean 1572 25.8%
The relative and absolute error histograms for the dual-core execution time predictions
are shown in Figure 5.21.
The CDF of both absolute and relative error is shown in Figure 5.22. It shows that
80% of the predictions have error under 1129 ms (38% relative error) and 90% of the
120




















Figure 5.21: Error histograms of test sets for dual-core execution time predictions for
absolute error (a) and relative error (b)
predictions under 4527 ms (78% relative error). By not including fractal compression in
these statistics, the 80% and 90% marks fall to 961 ms (32% relative error) and 2572 ms
(58% relative error), respectively.
























Figure 5.22: Empirical CDFs of test sets for dual-core execution time predictions for
absolute error (a) and relative error (b)
The top twenty variables in terms of importance are shown in Table 5.13.
121
Table 5.13: Dual core: Execution time ensemble variable importance
Group Variable name ∆ MSE
Block B Arithmetic operations 3.50× 103
Architecture Inst. cache size 6.01× 102
Architecture Multiplier type 3.90× 102
Block A Precision 1.71× 102
Block A Arithmetic operations 1.62× 102
Block B Multiplication ratio 9.91× 101
Total Memory operations 5.98× 101
Total Arithmetic operations 4.63× 101
Architecture Data cache size 2.84× 101
Architecture Data cache line Size 2.47× 101
Total Total operations. 2.05× 101
Block A Average working set size 7.73× 100
Block B Average working set size 6.55× 100
Block A Multiplication ratio 4.26× 100
Architecture Divider type 3.50× 100
Block A Memory operations 2.14× 100
Block A Variable working set size 1.69× 100
Block A Dependency set size 1.46× 100
Block A Total operations 1.23× 100
Total Multiplication ratio 1.19× 100
5.4 Sensitivity Analysis
Up until this point, test vectors consisted of algorithm parameters that were estimated
“perfectly.” The algorithm parameters that were used to construct and validate the
models were not estimated but were extracted using profiling software. However, when
a designer uses this tool these parameters must be estimated.
This approach shifts the burden of accuracy from automated analysis, as in other
work, to the designer. While this allows for earlier estimation of performance, one of the
problems with this approach is that the designer is human and is therefore subject to
human error.
The designer will estimate type II parameters (described in Section 4.2) of their desired
algorithm by comparing it against those algorithms used to generate the model. However,
122
the designer will not always get this estimation correct which can affect the accuracy of
the predictions. This additional element of human error is one of the consequences of
a high-level model. This section describes the analysis of the sensitivity of the model
to variations in the estimated type II algorithm parameters. First, the method used to
perform this analysis will be discussed. After this, the sensitivity of the models of each
performance metric will be examined. The goal of this work is to evaluate the robustness
of the model with respect to human-induced error in the estimation process. Ideally, the
models will be insensitive to this human error.
5.4.1 Method of Analysis
Regression trees inherently store information on the descriptiveness of predictor variables
in their structure; it is easy to deduce which predictor variables contribute more to the
decrease in error for the tree. Predictor variables which contribute more to the decrease
in error are more important than those which contribute little. It can be hypothesized
that misestimation of more important predictor variables will have much more of an
effect than misestimating predictor variables of little importance. It will be shown,
however, that this is not the case.
Table 5.14 summarizes the variable importance taken from Table 5.3, Table 5.5, and
Table 5.7. This shows the variable importance of three different models for the top
twenty most important variables.
From the tables, it can be seen that all performance metrics rely on type II parameters.
For execution time, type II parameters comprise four of the top ten most important
variables. It is reasonable to hypothesize that misestimation of type II parameters can
cause a significant amount of error.
123
Table 5.14: Ensemble variable importance: Single core models
Core power Device power Execution time
1 Data cache size Inst. cache size Arithmetic operations
2 Inst. cache size Data cache size Multiplier type
3 Data cache line Size Inst. cache burst transfers Inst. cache size
4 Multiplier type Data cache line Size Arithmetic operations
5 Divider type Total operations. Precision
6 Inst. cache burst transfers Data cache burst transfers Data cache size
7 Average working set size Multiplier type Dependency set size
8 Multiplication ratio Arithmetic operations Multiplication ratio
9 Arithmetic operations Precision Variable working set size
10 Arithmetic operations Memory operations Memory operations
11 Data cache burst transfers Arithmetic operations Data cache line Size
12 Average working set size Multiplication ratio Total operations.
13 Multiplication ratio Average working set size Average working set size
14 Precision Divider type Arithmetic operations
15 Arithmetic operations Arithmetic operations Memory operations
16 Memory operations Average working set size Divider type
17 Memory operations Total operations Total operations
18 Average working set size Multiplication ratio Multiplication ratio
19 Total operations. Memory operations Average working set size
20 Multiplication ratio Average working set size Inst. cache burst transfers
In order to perform this sensitivity analysis, assumptions on human error in estimation
must be made. For this work, human error is considered to be normally distributed with
a mean of zero. The normal distribution of parameter values is described as
N (µ, σ).
The range of values for any given estimated value will be normally distributed with
the mean being the “true” value µ as provided by profiling tools.
The standard deviation σ of the human error distribution is taken here to be pro-
portional to the distance between the values of the most similar algorithms and the
true value of the parameter. As an example, consider Figure 5.23 for estimating the
parameter number of arithmetic operations for JPEG XR coding. This is made
124







































Figure 5.23: Example parameter estimation for Block B arithmetic operations
Figure 5.24: Adjacent values for Block B arithmetic operations
In Figure 5.24, values a and b are the distances between the parameter being examined
and the two adjacent values.
125
The standard deviation σ is related to a and b by
σ ∝ min(a, b).
By considering the value of min(a, b) to be points for the 95% confidence interval for
the distribution, the actual distribution can be calculated. This distribution is shown in
Figure 5.25.
Figure 5.25: Range of values for JXR coding sensitivity analysis
A common technique for sampling a known distribution many times is called the
Monte Carlo method. The input distribution is sampled many times and the output of
the model is examined. In this case, the type II estimated parameters will be sampled
to see their effect on the output. As it is difficult to know exactly which of the type II
parameters will be erroneously estimated, every type II parameter has a probability
equal to 0.5 of being misestimated for each Monte Carlo iteration. There are five type II
parameters for each block in addition to the total of the type II parameters, resulting in
15 different parameters changing for each Monte Carlo iteration though only the 10 block
parameters are actually sampled (the other 5 are a sum of the sampled parameters).
This analysis is done using each cross-validation fold as discussed in Section 5.2. For
every fold, the distribution of all type II inputs is determined using the method shown
126
in Figure 5.25. Then for each fold, the Monte Carlo method is used for every test vector
for 5,000 iterations each. That is for every combination of architecture parameters
(3,288 in total), there will be 5,000 different samples. This is to ensure that sufficient
coverage is given to seeing the effect of misestimation and its relationship to architecture
parameters.
For each fold, sensitivity analysis is performed for each performance metric. In
sensitivity analysis, the concept of interest is the change from the originally predicted
value. This will be given as both the absolute ∆A and relative change ∆R of the
misestimated predictions pM versus the predictions for the true values of the parameters
p. For a Monte Carlo iteration i, ∆A and ∆R are defined as




The histograms below will show an aggregation of all values for ∆A and ∆R over all
folds, test vectors, and iterations.
5.4.2 FPGA Core Power
The results of the sensitivity analysis of the core power model to type II parameters are
shown in Figure 5.26 and Figure 5.27. The figures show the absolute and relative change
of each Monte Carlo iteration versus the prediction obtained when using the true values
of the parameters.
The figures show that core power predictions are insensitive to parameter misestimation.
This is partly due to the fact that, as Table 5.14 shows, the most important variables in
predicting core power are architecture parameters. The distribution ofmeasured values
may also contribute to the robustness of this model to misestimations.
127



















Figure 5.26: Core power: Change in predictions for misestimated parameters



















Figure 5.27: Core power: Relative change in predictions for misestimated parameters
128
The distribution of the values shown in Figure 5.26 and Figure 5.27 are described in
Table 5.15; it shows the 50th, 75th, 90th, and 98th percentiles for both the absolute
and relative changes. On the table, any numbers showing “0.0” are below the accuracy
of measurements. As the predictions are an average of all trees, the difference of two
similar vectors may only be in a few trees in the ensemble resulting in a sub-resolution
difference. The table shows that 98% of the misestimation cases resulted in deviations
from the true predictions by 3.3 mW or 1.6% with a mean of 0.4 mW. This level of
error is not an issue for core power predictions and as a result the core power model is
insensitive to parameter misestimation of type II parameters.




50th 75th 90th 98th
Absolute (mW) 0.2 0.5 1.3 3.3 0.4
Relative (%) 0.1 0.2 0.6 1.6 0.0
5.4.3 Off-Chip Device Power
The results of the sensitivity analysis of the device power model to type II parameters
are shown in Figure 5.28 and Figure 5.29. The figures show the absolute and relative
change of each Monte Carlo iteration versus the prediction obtained when using the true
values of the parameters.
The results of the device power sensitivity are similar to those of the core power,
showing that the model is insensitive to type II parameter misestimations. Table 5.16
shows the percentiles for the change distributions. The 98th percentile is increased
from the core power case to 9.4 mW or 1.7% error with a mean of 1.1 mW. This
level of error will not cause results to change much from their true values due to this
misestimation therefore showing that the device power is insensitive to misestimation of
type II parameters.
129



















Figure 5.28: Device power: Change in predictions for misestimated parameters



















Figure 5.29: Device power: Relative change in predictions for misestimated parameters
130




50th 75th 90th 98th
Absolute (mW) 0.0 0.9 3.6 9.4 1.1
Relative (%) 0.0 0.2 0.7 1.7 0.0
5.4.4 Execution Time
The results of the sensitivity analysis of the execution time model to type II parameters
are shown in Figure 5.30 and Figure 5.31. The figures show the absolute and relative
change of each Monte Carlo iteration versus the prediction obtained when using the true
values of the parameters.





















Figure 5.30: Execution time: Change in predictions for misestimated parameters
This sensitivity analysis for execution time shows that, in general, the model is robust
to misestimation of parameters as caused by the human input. Analysis shows that 75%
of the cases resulted in less than 184 ms change with a mean of 1280 ms. There are
a number of misestimations that result in a large amount of deviation from the true
131



















Figure 5.31: Execution time: Relative change in predictions for misestimated parameters
prediction having a maximum change of 110× 103 ms. However, these larger changes
in predictions would be the result of a large amount of misestimation of a number of
key parameters which, in practice, should not happen. In fact, these large changes come
from all algorithms whose “B” block is the EBCOT method. Blocks which use the
EBCOT method have, at least, a change in prediction three times greater than that of
methods that do not use EBCOT. This is likely due to the values of adjacent parameter
values when calculating the normal distribution for input values. The values of a and
b in calculating the normal distribution are relatively large for EBCOT for two key
parameters: memory operations and total operations. These large values of a and b
result in a wider distribution which will ultimately lead to a larger increase in error
compared to other methods.
This problem is further compounded by the issue that the model was constructed
using the combination of blocks. The execution time of EBCOT is much longer than that
132
of Huffman coding and the JPEG XR coding method; if the estimated parameters for
EBCOT are underestimating the number of operations (arithmetic, memory, or both),
the model starts interpreting these lower numbers as the much faster executing Huffman
or JPEG XR methods. In practice, this should not happen as the parameter differences
between EBCOT and Huffman/JPEG XR are distinct enough that misestimating of
these parameters to this degree would represent a large failing on the part of the designer.




50th 75th 90th 98th
Absolute (ms) 0.0 184.1 1182.4 8851.9 1279.5
Relative (%) 0.0 9.1 84.8 531.1 0.4
133
5.5 Model Benchmarking
In Section 5.2 and Section 5.3, the accuracy of the three models was examined. While
this is useful for understanding the overall accuracy of the models, it is important
to further examine the predictions produced by the models. The predictions will be
examined to ensure that they accurately reflect the reality of the algorithms. That is,
if an implementation of an algorithm runs faster than an implementation of another
algorithm then this should be reflected throughout the predictions.
To see how well the model maintains appropriate algorithm ranking, rank correlation
analysis is used. For each combination of architecture parameters, all the values for
the algorithms under investigation are sorted using both predicted and truth values.
From here, the indices of the sorted lists are compared using the Spearman’s rho (ρ)
for rank correlation. This coefficient gives a measure of correlation between the truth
and predicted ranked lists. The value lies between -1 (negatively correlated) and +1
(positively correlated).
Spearman’s ρ for ranks xi and yi is defined as




where di = xi − yi.
The following section describes how the model performs when predicting the per-
formance of the DCT and its variants. This benchmarking shows that the model is
providing predictions that are realistic and contain low amounts of error.
5.5.1 Algorithm Optimizations and the DCT
Most algorithms will undergo an optimization process from conception to implementation.
The end result of this optimization is that the algorithm should execute in less time or
consume less resources. Such optimizations typically involve the move from floating-
134
point arithmetic to fixed-point arithmetic or changes to the actual algorithm itself to
produce an approximation of the original algorithm. To maintain a useful model, the
model should be able to predict the correct ordering of unoptimized algorithms to their
optimized versions.
To show this, the predictions of all the DCT variants will be examined. There are
three versions of the DCT available in libjpeg [43]: floating point, “slow” integer, and
fast integer approximation. The floating-point transform offers the highest accuracy
with the slowest execution time while the fast integer approximation is the fastest to
execute at the expense of accuracy. The “slow” integer transform lies between the two.
In this analysis, all algorithms that use any of the DCT variants will be included.
The graphs in this section show the distribution of the ρ values for the rank correlation
computation aggregated over all folds and all test vectors. This analysis uses the
same approach as described in the Results section (Section 5.2); for each architecture
combination, performance predictions from every fold that uses a DCT variant are
ranked against each other. The Spearman coefficient is calculated for this ranked list
compared to the same ranked list but using truth values instead of predictions.
The algorithms of interest are shown in Table 5.18.
Table 5.18: DCT algorithms
Algorithms using DCT variants
DCT 8x8+EBCOT DCT 8x8+JXR
DCT Float+EBCOT DCT Float+JXR




The distribution of Spearman’s ρ coefficients for execution time prediction is shown in
Figure 5.32. For this distribution, the mean Spearman coefficient is 0.73. This indicates
that the model maintains a strong correlation between the ranked predictions and truth
135









Figure 5.32: Single-core DCTs: Execution time Spearman’s rho value distribution
values for this set of optimized and unoptimized DCTs. Simply put, this approach is
able to accurately predict the changes in execution time across many architectures for a
particular transform and its variants.
Similarly, the results from the rank correlation for core and device power are shown in
Figures 5.33 and 5.34, respectively.
The distributions for both types of power show that power consumption is much less
dependent on the optimizations of the algorithm. It shows that the structure of the
algorithm has a much larger effect on power consumption; power consumption does not
change much between changes in the number and types of computations, such as the
changes that happen during optimization.
Ultimately these results show that the model is capable of correctly ranking the
execution time of the variants of the DCT. For designers, this means that they can be
confident in algorithm space exploration if considering the execution time differences in
136








Figure 5.33: Single-core DCTs: Core power Spearman’s rho value distribution









Figure 5.34: Single-core DCTs: Device power Spearman’s rho value distribution
137
the optimizations of a single algorithm. On the other hand, the power consumption of a
block does not change much between types of optimizations.
138
5.6 Conclusion
This chapter has presented the parameter extraction and model construction stages of a
high-level framework for performance estimation.
Using the parameters extracted using the method described in the previous chapter,
an accurate model was constructed for FPGA core power consumption, off-chip device
power consumption, and execution time. The single processor system showed high levels
of accuracy for both core and device power models, with a mean error of 1.3% (2.5 mW)
and 1.86% (8.5 mW), respectively. The execution time model showed slightly higher
amounts of error, with a mean error of 18.3% (3477 ms). Models for the dual processor
system showed similar trends, the core and device power models have a mean error of
1.9% (5.0 mW) and 2.8% (16.9 mW) respectively, while the execution time model has a
mean error of 25.8% (1572 mW).
At the moment, the model is only used to predict the power and performance of a single
or dual soft-processor system using SDRAM. However, by characterizing differences in
implementations and power consumption, the model can become more general and allow
for richer design space exploration among not only architecture parameters but different
architectures as well.
Next, sensitivity analysis was performed that shows the sensitivity of each model to
the effects of human error in the estimation process. It showed that the core and device
power models were insensitive to the effects of misestimation of type II parameters. The
execution time model was more sensitive but only is cases where there many of the
parameters have been misestimated by a large amount.
The chapter was concluded with analysis of the relative predictive performance of the
execution time model when considering variants of the DCT. This section showed how
the models maintained relative positioning of predictions when considering algorithmic,
rather than structural, changes to a building block.
139
6 A Framework for Design Guidance
using a High-Level Model
This chapter examines how the framework can be used to provide early design guidance
to designers using the previously examined high-level model. First, it is shown how
the framework performs design space exploration (DSE) of the architecture parameters.
Then, node statistics and test vector characteristics are used to provide a measure of
prediction confidence. Finally, how the framework predicts the effects of Single Event
Upsets (SEUs) is examined.
140
6.1 Architecture Design Space Exploration
This section describes the method by which the framework uses its predictions in order
to perform optimization and ultimately design space exploration.
To perform design space exploration using the framework, the user of the framework
inputs the parameters of their algorithm into the framework along with the architec-
ture parameter of interest. From here, the framework, using its predictions, shows
how the different values for the parameter-of-interest affect execution time and power
consumption.
By having early design guidance, hardware can be made more efficient through the
efficient allocation of resources to components (e.g. multipliers or caches).
The following sections show how the framework can be used to optimize for the two
main resources available on an FPGA: memory blocks and logic elements.
6.1.1 Cache Size Optimization
Consider the situation of a designer looking to use a soft processor for an image
compression application. As their soft processor is part of a larger system, resource
usage is an important consideration. It will be shown via examples how this framework
can be used to select the ideal sizes of the caches.
As the parameters of their algorithm are known, they are input into the model as well
as the fact that cache sizes are the design space parameter of interest. The issue here is
not directly cache sizes, but rather the resources used by the various sizes of cache. In
the FPGA core, both caches use the M9K memory blocks of which there is a limited
amount, especially in the economy versions of FPGAs.
For now, the trade-offs considered are both types of power, core and off-chip device,
and execution time. Total power will be considered here which is a sum of both core
and off-chip device power.
141
The framework calculates the predictions for each of the metrics for every combination
of architecture parameters along with the given algorithm parameters. From here, two
types of multi-objective analysis can be performed. First, the most efficient architectures
in terms of execution time versus power are found and presented to the designer along
with the corresponding research usage. Second, optimization can be done in order to
determine the most efficient configuration of cache sizes for either execution time or
power.
The first type of optimization is finding the most efficient architectures when con-
sidering power and execution time. Once these are found, the resource usage of these
architectures are presented to the designer. Non-optimal design points are also given to
the designer so a more complete picture of the design space is obtained. An example is
shown in Figure 6.1. In addition to the design points, the user is able to scroll over a
data point and see the associated architecture parameters and resource usage for that
point, as seen on the figure.

















I Cache burst: 1















Figure 6.1: Optimal architectures: Total power versus execution time by M9K block
usage
142
For this example algorithm, Figure 6.1 shows the most optimal architectures along
with their memory block usage. The example shown here is the predictions for the
reversible DWT and EBCOT combination. Examining the architectures along the Pareto
front and those nearby shows that the low execution time, high power architectures
consist of larger cache sizes (8 KB and above) using embedded multipliers. An important
use of this graph is to compare architectures of similar performance but different resource
usage. If two architectures have similar performance but one uses less memory blocks, it
is easy for the designer to see this and make the appropriate design decision.
The second type of optimization takes into account the direct effect of cache sizes.
Optimizing for memory block usage allows the designer to select a cache size configuration
that conforms to either a power or execution time budget. Figure 6.2 shows M9K usage
versus core power for an example algorithm.



















Figure 6.2: Predicted optimums: Core power versus M9K block usage
143
For each value of memory blocks used, the architectures which have the lowest core
power consumption are presented to the user. This allows the designer to see how
the core power is affected by various combinations of cache sizes and can choose the
appropriate size. Similar to the previous approach, the user is able to see the specific
architecture parameters for each value of memory block usage. For each usage value,
fifteen optimal architectures are shown.
Figure 6.2 shows the directly proportional relationship between core power and the
number of memory blocks used. This increase in power is also due to the increase in
logic elements that are needed to support the additional memory blocks.
In the same way, device power versus memory block usage is shown in Figure 6.3.
This figure shows a more complicated relationship between device power and memory
usage. Small cache sizes increases device power consumption due to higher miss rates.





















Figure 6.3: Predicted optimums: Device power versus M9K block usage
144
Figure 6.4 shows the total power consumption versus memory block usage. No new
predictions are shown here (the figure just shows the sum of Figure 6.2 and Figure 6.3)
but it does allow designers to find efficient memory block usage values for their design.




















Figure 6.4: Predicted optimums: Total power versus M9K block usage
Finally, Figure 6.5 shows execution time versus memory block usage. The lowest
execution times, as expected, belong to the largest caches. More importantly, this figure
shows the diminishing returns of increased cache sizes; once 32 memory blocks are used
(corresponding to both caches being of sizes 8 KB) the resulting decrease in execution
time from larger caches becomes smaller. By knowing this, the designer can use the
memory blocks for other components in the hypothetical system without sacrificing
performance.
145


















Figure 6.5: Predicted optimums: Execution time versus M9K block usage
6.1.2 Logic Element Optimization
This section is similar to the previous section and considers the optimization of logic
element (LE) usage instead of memory block usage; it is meant to show the options
available to designers while using this framework. LE usage differs greatly from memory
block usage due to the fact that all architecture parameters affect LE usage where
memory block usage is only affected by cache sizes (and to a lesser extent other cache
parameters).
The structure of this section is identical to the previous, showing the most optimal
architectures in terms of power and execution time followed by the optimization of
individual performance metrics with respect to LE usage. Here, the actual values of
LE usage were quantized into seven evenly spaced bins to allow for the data to be
represented clearly.
Figure 6.6 shows the most optimal architectures for the algorithm along with the LE
146
usage of each architecture.

















I Cache size: 8192
D Cache size: 8192
D Cache line size: 16
I Cache burst: 1














Figure 6.6: Optimal architectures: Total power versus execution time by LE usage
Figure 6.7 shows, similar to the memory block case, the proportional relationship
between increasing LE usage and increasing core power consumption. This is due to
increased static power consumption as well as increased power consumption due to
additional interconnect.
Device power and LE usage are, conversely, not directly linked. Figure 6.8 shows a
slight downward trend but ultimately the device power is independent of LE usage.
As in the memory block case, how LE usage affects the total power and execution
time are shown in Figure 6.9 and Figure 6.10, respectively.
147




















Figure 6.7: Optimal architectures: Core power versus LE usage



















Figure 6.8: Optimal architectures: Device power versus LE usage
148

















Figure 6.9: Optimal architectures: Total power versus LE usage




















Figure 6.10: Optimal architectures: Execution time versus LE usage
149
6.2 Providing Prediction Confidence
For any given test vector, there will be uncertainty in the prediction. This can be due
to error in the estimation of a parameter or possibly due to the ensemble itself. This
section explores how characteristics of the training data can be used to give the user of
this tool an idea of how confident the ensemble is about its prediction.
First, different metrics for providing prediction confidence are defined. This is followed
by a discussion on how these metrics can be used to provide the user with more informative
output and the degree of confidence the ensemble has in the prediction.
6.2.1 Measures for Prediction Confidence
Node Statistics
Each node in a regression tree contains information on the data set associated with it.
This information is the node mean tM , node error tE (absolute deviation from the mean
of the response variable in that node), and node size tS of each node generated during




where N is the total number of training vectors. Further, the node risk tR is defined
as the product of the node probability and node error
tR = tP × tE .
Though these statistics are gathered from the training data, error, probability, and
risk show the characteristics of training instances associated with a given node. Further,
it gives an indication as to the confidence of any future instances that happen to fall on
that node.
150
For an input vector, the ensemble will produce a response value for each tree within
the ensemble. For each tree, there is an associated node that the response came from.
For each of these nodes, there is an associated error, probability, and risk value. These
values give an indication as to the confidence of an individual tree’s prediction and by
extension to the confidence of the entire ensemble.
Outlier Measure
By using multiple regression trees, a test vector can be characterized against the training
vectors to determine the similarity between the training set and the test vector. If
similar, it will be shown that the probability of having higher error in the prediction is
lower than if the two are dissimilar. Further, the metric of interest here is how dissimilar
a test vector is from the training vectors. To measure this, the idea of proximity and
the outlier measure is used, which was first introduced by Breiman [8].
The proximity measure α of two vectors v1 and v2 is defined as the proportion of
trees in the ensemble in which their predictions fall on the same node. To calculate
the proximity measure αX of the set of training data X, the proximity measure must
be calculated pairwise between every training vector. Therefore αX will be a square,
symmetric matrix with as many rows as training vectors. ThisαX will then be used to
calculate the outlier measure for any vector against the training set.









Here, µ2i is the average squared proximity of the vector vi against all other vectors in
X.
151
α˜X is the median of all proximity values in αX and mad(αX) indicates the median
absolute deviation from this median of all values in αX defined as
mad(αX) = median(|X − α˜X |).
Typically, this outlier measure is used in classification to determine outliers within a
class [8]. However, it will be seen that it can be used in regression as an indication of
confidence.
6.2.2 Prediction Confidence for Execution Time
To begin with, how node statistics and the outlier measure affect the error of execution
time predictions will be examined. After this, prediction confidence of power consumption
predictions will be examined.
As the outlier measure only has meaning within the data set it was calculated from,
it is important to have an idea of how the training set relates to itself. That is, how
close are the training vectors are to one another will determine how test vectors relate
to the same training vectors. Figure 6.11 shows the distribution of outlier measures for
all training vectors. This is continuing to use the cross-validation folds described in
Section 5.2. The figure shows the distribution of outlier measures of all training sets
against all other training sets.
How a test vector’s outlier measure relates to these values will determine the probability
of that test vector having larger error in the prediction. That is, a test vector having a
large outlier measure with respect to the values of the outlier measures for the training
set will be more likely to have more error in its predicted value.
First, a relationship must be established between the error of a given input test vector,
the outlier measure of that vector, and the node statistics associated with the prediction
of that vector.
152









Figure 6.11: Execution time: Training set outlier measures histogram
The outlier measure is plotted versus the absolute error across all test sets, shown in
Figure 6.12. It can be seen that if a test vector has an outlier measure between 4 and
10, it has a higher probability of having more error. Intuitively, a test vector having an
outlier measure greater than 10 is likely to also have higher error due to the nature of
the metric. However, there were not enough test cases here to adequately characterize
this space.
Figure 6.13 shows how the mean, median, and interquartile range (IQR) of the absolute
changes with the outlier measure. This shows that as the outlier measure of a prediction
increases, so does the error and the range of this error.
Node probability is plotted versus absolute error in Figure 6.14. Here, only nodes
with very small probability contain high error in the test set; this probability is around
1× 10−4 which corresponds to the minimum node size of a tree.
153
















Figure 6.12: Execution time: Absolute error versus outlier measure























Figure 6.13: Execution time: Mean, median, and IQR versus outlier measure
154


















Figure 6.14: Execution time: Absolute error versus node probability
From these, it is reasonable to hypothesize that a test vector having an outlier measure
greater than 4 and a node probability between 0 and 1 × 10−4 will have a greater
probability of containing high error. To test this, the set of test vectors meeting these
conditions (unstable) were compared against those that did not (stable). The mean,
median, and IQR of their distributions are shown in Table 6.1.
Table 6.1: Test vector distribution statistics
Algorithm Mean A Median A IQR
(ms) (ms)
Stable 583 85 580
Unstable 3609 1690 3692
This table shows that the values of absolute error for the unstable test vectors are
more dispersed and have a higher probability of having more error than stable vectors.
155
6.2.3 Prediction Confidence for Core Power
This section will show the results of providing prediction confidence for core power
predictions. The method of evaluation is identical to that of execution time.
The distribution of training vector outlier measures with respect to all other training
vectors is shown in Figure 6.15.








Figure 6.15: Core power: Training set outlier measures
The error of core power predictions, though significantly smaller than those of execution
time, are shown in Figure 6.16. The mean, median, and interquartile range are shown
in Figure 6.17 which show that the mean and median increase as the outlier measure
reaches a value of around 17. This increase is nearly five times as large when compared
to the mean and median error at smaller values of the outlier measure.
Absolute error as a function of node probability is shown in Figure 6.18. Unlike the
execution time case, the node probabilities for the core power predictions are not widely
distributed.
156


















Figure 6.16: Core power: Absolute error versus outlier measure






















Figure 6.17: Core power: Mean, median, and IQR versus outlier measure
157
The highest error falls at the same points as the execution time case which is at roughly
the mean leaf node probability for all trees. Because of this, it is unlikely that prediction
confidence as a function of node probability will yield consistent results meaning only
the outlier measure will be given to designers for core power consumption.




















Figure 6.18: Core power: Absolute error versus node probability
Similar to the execution time case, predictions with an outlier measure over a certain
value are considered unstable and have higher error than those predictions with a lower
outlier measure. As mentioned above, this value is 17. The results of applying this to
the data are shown in Table 6.2.
Table 6.2: Core power: Test vector distribution statistics
Prediction Mean A Median A IQR
type (mW) (mW)
Stable 2.47 1.80 2.51
Unstable 3.95 3.11 4.11
158
6.2.4 Prediction Confidence for Device Power
This section will show the results of providing prediction confidence for device power
predictions. The method of evaluation is identical to that of execution time and core
power.
The distribution of training vector outlier measures with respect to all other training
vectors is shown in Figure 6.19, showing a similar distribution as the core power case.








Figure 6.19: Device power: Training set outlier measures
The error of device power predictions versus outlier measure are shown in Figure 6.20.
Unlike the previous two cases, the figure suggests no legitimate relationship between
outlier measure and prediction error. This is confirmed by the mean, median, and IQR
of the outlier measures shown in Figure 6.21.
Absolute error as a function of node probability is shown in Figure 6.22. Like the core
power case, the probabilities at which the highest error falls is not able to be used to
assist in providing prediction confidence for device power predictions.
159
















Figure 6.20: Device power: Absolute error versus outlier measure






















Figure 6.21: Device power: Mean, median, and IQR versus outlier measure
160


















Figure 6.22: Device power: Absolute error versus node probability
Ultimately, the error in device power predictions cannot be defined using either outlier
measure or node statistics.
161
6.3 Estimating Effects of Single Event Upsets
The purpose of this section is to show how the framework can estimate the effects of
single event upsets (SEUs) that can occur in the arithmetic hardware of the FPGA.
When an SEU occurs in a multiplier or divider, it must first be detected and then
fixed by reconfiguration of the FPGA. Between the time of the SEU occurring and the
reconfiguration of the device, there is a cost associated with this SEU. During this time,
the arithmetic hardware used by the soft processor must be discontinued and execution
stopped or, for more fault-tolerant designs, multiplications or divisions must be emulated
in software until reconfiguration can be completed. They are typically called “soft errors”
as they are temporary and only affect the data stored by the affected element.
Ultimately, the goal of this work is to show designers the potential cost of failures due
to SEUs in their final design. Using this information, they can make more informed
decisions about the viability of their algorithm and design with respect to SEUs.
First, the background of SEUs will be discussed. This includes the sources of SEUs
as well as current and future issues. Finally, a discussion of the framework’s ability to
accurately predict the effect of SEUs in multipliers and dividers will be examined.
6.3.1 Background
SEUs are the result of ionizing radiation that cause a change of state of a memory cell,
such as SRAM. This radiation can be caused by a few sources, mainly coming from the
collision of high-energy cosmic rays and solar particles with the upper atmosphere; this
causes a ”shower” of high-energy protons and neutrons. In practice, neutrons cause more
problems due to their ability to penetrate many man-made structures. The effect of
these collisions on terrestrial electronics is dependent on latitude [60], longitude [60], and
elevation [66]. According to Microsemi [59], this effect on in-flight airplanes is 100-800
times worse than at sea level. Another source of this ionizing radiation is from the
162
packaging of semiconductors, which contain trace elements of uranium and thorium
which naturally emit alpha particles as they decay [60].
The amount of charge needed to change the state of a memory cell is referred to
as QCRIT , which is affected by a number of device characteristics. Falling operating
voltages, faster switching speeds, and smaller devices all act to decrease QCRIT . For
instance, the estimated decrease in QCRIT from the 65 nm process to the 45 nm process
was 30%, meaning SRAM (and combinational logic) cells are becoming more susceptible
to SEUs [85]. Therefore it is increasingly important for designers to know the effect of
SEUs on their final designs.
For this work, only SEUs within multipliers are considered. Although compression
algorithms are dominated by memory accesses, arithmetic operations are the basis of
the compression itself. From transforms, distance calculations, or arithmetic coding,
the algorithms rely heavily on the hardware to perform these operations quickly and
accurately. For this work, all SEUs are considered to cause an incorrect result or to
disrupt otherwise normal operation.
When an SEU is detected, the use of arithmetic hardware in question is discontinued
and its function is then emulated in software until the system can be reconfigured. From
this description, there are many aspects of this process which will affect the total cost of
an SEU. First, there are costs incurred between the SEU occurring and the detection
of it. This mainly includes corrupted results from the computations. Second, there are
execution time costs associated with moving the operations into software emulation.
Lastly, there will be costs associated with the reconfiguration of the device. The purpose
of this section is to show the ability of the framework to estimate the total execution
time cost associated with an SEU which occurs in the arithmetic hardware.
163
6.3.2 Effects on Multipliers
Soft processors use three different types of multipliers: LUT multipliers, embedded
multipliers, and DSP multipliers. However, the Cyclone III development board used
for this work does not contain DSP blocks so these will not be considered. The LUT
multipliers use the FPGA logic to implement a LUT-based multiplier. The embedded
multipliers are dedicated circuits that perform 8-bit by 8-bit multiplications. The NIOS 2
cascades four of these in order to perform its multiplications. This work is not considering
SEUs in the FPGA SRAM due to many techniques existing to deal with this.
For this work, the following assumptions are made about the system. First, it is
assumed that all SEUs that occur cause errors in the operation or results in the multipliers
which render them unable to perform correctly. Second, it is assumed that SEUs are
detected by hardware which is not considered here. Lastly, once an SEU is detected,
multiply operations are switched from the affected hardware and then emulated until the
FPGA can be reconfigured. The time spent in software emulation incurs an execution
time cost, which is the focus of this section. As the detection and repair time are not
dependent on the multiplier type, these will not be considered.
To evaluate the execution time increase due to SEUs on the multipliers, the method of
construction outlined in Section 5.1.5 must be modified. By creating three separate trees,
one for each multiplier type, the differences between the performance of the multipliers
can be seen. This is required for regression trees as differences in a single parameter
(such as multiplier type) may not have an effect on the output–it depends on which
branches of the tree are used during the evaluation of a test vector.
This section is broken into two separate parts. The first shows how the models can be
used to provide guidance to designer. The second provides validation of the model.
164
Providing Design Guidance
For the designer, it is beneficial for them to know the cost of SEUs in the final design.
For any given architecture, the cost of an SEU will differ. This is dependent on the
multiplier type as well as other parameters such as cache size. Designers can assess the
effect of SEUs on their design in two ways. The first is seeing the cost of SEUs on a
specific architecture. The second is a design space exploration (DSE) approach that
allows the designer to see a number of architectures and their associated SEU cost.
A basic way for designers to assess the sensitivity of their design to SEUs is to input
their specific architecture parameters into the SEU effect prediction framework. To show
this, an example algorithm is used for this section. Figure 6.23 shows this simple mode
of operation which gives the execution time cost for an SEU.
Figure 6.23: SEU execution time cost estimation of a single architecture
As SEUs become a bigger problem, designs may need to be chosen based on their
low cost of SEUs. To this end, a more comprehensive approach is needed that shares
concepts with the DSE approach demonstrated in Section 6.1.
Figure 6.24 shows what the designer would see. The figure shows the optimal archi-
tectures when considering execution time and total power consumption. The cost of an
SEU determines the size of the marker on the graph; larger markers indicate a higher
cost of an SEU.
165















I Cache size: 32768
D Cache size: 8192
D Cache line size: 16
I Cache burst: 0








Figure 6.24: SEU cost estimate of optimal architectures
Embedded multipliers are faster than logic-based ones and therefore will almost always
incur a higher cost. However, logic multipliers use more SRAM cells and routing resources
which means they are more susceptible to SEUs.
To account for this, the frequency of SEUs must be taken into account. By scaling
the cost of the SEUs by the relative frequencies of SEUs, a more clear picture can
be delivered to the designer. By creating sample designs that contains only a single
multiplier of a particular type, it was seen that logic multipliers use roughly 10 times as
many LUTs as the embedded multipliers. A consequence of this is that logic multipliers
are 10 times more likely to encur an SEU. The scaled version is shown in Figure 6.25.
166















I Cache size: 32768
D Cache size: 8192
D Cache line size: 16
I Cache burst: 0








Figure 6.25: Normalized SEU cost estimate of optimal architectures
Validation
The framework gives useful estimates of SEU costs but their accuracy must also be
investigated. The accuracy depends on how accurately the models can determine the
effect of changing the type of multiplier type on the total execution time. This section
shows that this method of estimating the effects of SEU produces results that are
accurate.
To validate the results, attention must be given not only to the accuracy of the models
in general, but also to the accuracy of predictions when considering specific multiplier
types. Consider Table 6.3 which shows the mean relative error per fold by multiplier
type. In general, prediction error for the software multiplier case is higher than that
of the embedded or LUT multiplier case. This is likely due to the increased latency of
operations; the more cycles spent in multiplication means that these operations are more
susceptible to effects that are difficult to predict such as pipeline stalls.
167
Consequently, this can cause higher error in an SEU cost prediction.
Table 6.3: Execution time per multiplier performance
Mean relative error (%)
Combination Software Logic Embedded
DCT 4x4+EBCOT 1.2 0.8 0.9
DCT 8x8+EBCOT 1.0 0.8 0.8
DCT Float+EBCOT 26.7 23.5 23.1
DCT Fast+EBCOT 18.2 16.5 15.7
DCT Float+Huffman 79.4 75.2 75.4
DCT Fast+Huffman 8.4 7.8 7.9
DCT 4x4+Huffman 15.8 6.7 6.1
DCT 8x8+Huffman 9.4 4.6 3.5
DCT 4x4+JXR 4.8 3.3 3.0
DCT 8x8+JXR 3.4 2.6 2.1
DCT Float+JXR 28.8 21.8 23.7
DCT Fast+JXR 3.1 3.6 3.5
DWT Irr.+EBCOT 24.9 14.3 14.9
DWT Irr.+Huffman 20.0 13.6 13.9
DWT Irr.+JXR 47.4 36.5 37.4
DWT Rev.+EBCOT 5.4 1.5 1.5
DWT Rev.+Huffman 18.0 7.2 7.7
DWT Rev.+JXR 3.4 4.3 4.0
Fractal 90.2 87.3 87.3
PCT+EBCOT 1.8 0.7 0.8
PCT+Huffman 23.1 5.3 5.4
PCT+JXR 6.5 3.3 3.0
Vector Quantization 11.7 43.0 40.2
WebP (method 2) 51.6 31.6 32.1
WebP (method 4) 9.7 13.8 13.5
Mean 20.6 17.2 17.1
Table 6.4 shows the cost error performance of the model when looking at the SEU
cost of logic and embedded multipliers. The cost error c is the difference between the
predicted cost and the actual cost. That is,
c = |Cactual − Cpredicted| .
168





The normalized cost error is also used here as it can overcome one of the problems
of using regression trees. As discussed in Section 3.3, predictions given by regression
trees are determined by the mean of the response variable at each leaf node. Because of
this averaging effect, large values tend to be reduced. This introduces a scale problem
when comparing similar sets of data with different scales. In order to work around this,
the costs were normalized to their respective largest values, changing the range to [0, 1].
This allows for a direct comparison to be made between the predicted and actual costs.
By using both Table 6.3 and Table 6.4, the utility of SEU cost prediction can be
validated. At first glance, the relative cost error is high. After further inspection, the
combinations with the highest relative cost error (PCT+Huffman,WebP method 2, and
nearly any other that uses Huffman coding) all have a characteristic in common: these
methods contain relatively few multiply operations. These same combinations show
the same trend when considering the normalized cost error. The normalization process
reduces the problem of scale in many of the combinations, most notably in the WebP-
and EBCOT-based combinations. Though normalization is not useful to designers using
this tool, it does show that relative costs when comparing logic and embedded multipliers
are still valid.
169
Table 6.4: Execution time cost error performance
Relative error Normalized error
Combination Logic Embedded Logic Embedded
DCT 4x4+EBCOT 10.1 7.3 0.014 0.014
DCT 8x8+EBCOT 7.4 5.7 0.016 0.011
DCT Float+EBCOT 37.0 35.2 0.046 0.048
DCT Fast+EBCOT 35.0 27.9 0.029 0.027
DCT Float+Huffman 85.2 85.3 0.101 0.109
DCT Fast+Huffman 29.5 28.3 0.123 0.121
DCT 4x4+Huffman 65.8 66.3 0.132 0.149
DCT 8x8+Huffman 34.3 37.2 0.124 0.110
DCT 4x4+JXR 13.9 12.7 0.031 0.027
DCT 8x8+JXR 9.1 9.1 0.022 0.021
DCT Float+JXR 37.7 39.9 0.072 0.067
DCT Fast+JXR 7.2 6.5 0.030 0.032
DWT Irr.+EBCOT 46.2 45.3 0.061 0.071
DWT Irr.+Huffman 24.5 24.3 0.066 0.064
DWT Irr.+JXR 56.2 55.8 0.150 0.130
DWT Rev.+EBCOT 34.5 25.0 0.034 0.031
DWT Rev.+Huffman 88.2 88.3 0.579 0.629
DWT Rev.+JXR 16.6 14.1 0.089 0.072
Fractal 92.9 92.8 0.140 0.139
PCT+EBCOT 13.2 11.9 0.017 0.011
PCT+Huffman 2224.8 1987.3 0.409 0.935
PCT+JXR 24.3 22.0 0.033 0.029
Vector Quantization 23.4 23.8 0.114 0.113
WebP (method 2) 71.2 70.4 0.076 0.193
WebP (method 4) 16.7 16.5 0.087 0.085
Mean 124.2 113.5 0.104 0.130
6.4 Conclusions
This chapter outlined a number of ways that this high-level framework provides design
guidance early in the design phase.
An important part of design guidance is help in design space exploration. This section
showed how the framework can assist designers in selecting properly sized caches in
terms of memory block usage. The framework can display the optimal architectures in
terms of power and execution time along with their respective memory block usage. In
170
turn, this allows designers to make smarter choices to create more efficient hardware
with fewer resources.
There are conditions in which the underlying models given predictions with higher
error. For a prediction, its input vector is compared to the training data in order to
determine its similarity. These statistics are then used to provide the designer with
a measure of confidence in the prediction; an outlier measure that is too large means
the input vector may be too different from the training data, a characteristic that can
result in a prediction with a larger error. Rules for prediction confidence were derived
for the core power and execution time models, while the device power model displayed
unpredictable effects due to outlier measure and therefore no rules were able to be
defined.
Finally, the effects of SEUs were estimated. In a system where SEUs in arithmetic
hardware are detected and then moved to software. The framework can analyze the
execution time cost of SEUs as a function of the algorithm.
171
7 Conclusions and Future Work
7.1 Summary of conclusions
Performance modeling plays an important role in the design process as it can produce
much more efficient hardware that fulfills the demands of consumers who want faster
computations and longer battery life. These efficiency gains can be further increased
when the predictions can be given earlier in the design phase where more significant
decisions are made. This work provides a useful tool for designers that not only allows
early predictions to be made using architecture and high-level algorithm parameters,
but also allows design guidance to be given for resource usage and effects of SEUs.
Image compression methods on FPGA-based soft processors are popular due to
short time-to-market and the flexibility of the processors themselves. By using these
flexible processors and the framework presented here, designers can make efficient use of
the resources present on the FPGA. Designers will make greater use of FPGA-based
processors as the FPGAs themselves become faster and more versatile.
Soft processors are used here as they provide a quick and configurable way for system
designers to use microprocessors. The design space for soft processors can be very large.
It would take 25 days to generate all possible architectures for only a basic set of design
parameters. This can increase by three to six orders of magnitude with the addition
of other soft components. For this work, soft processors allow the relationship between
the architecture and application to be explored, such as the ability to see how cache
172
size affects the performance of different applications. By having an accurate model
of performance using high-level parameters, designers can quickly estimate how their
particular system will perform, long before an implementation is needed.
The construction of an accurate model is the most important part of providing such a
tool. To do this, a method for extracting high-level parameters from algorithms had to
be devised. This was done, in part, by decomposing algorithms into their basic building
blocks, then separating them by computational and logical distinctions. Some of the
algorithm parameters are extracted by the designer using knowledge of the algorithm.
Examples include arithmetic precision, working set sizes, and the presence of arbitrary
multiplication. The remaining parameters must be estimated by the designer, such as
the total number of memory or arithmetic operations. In doing this, the knowledge of
the designer is leveraged to create a more accurate model.
Appropriate models can be constructed as a function of these high-level parameters.
In this study, three accurate models of relevant performance metrics, FPGA core power,
off-chip device power, and execution time, were constructed using appropriate algorithm
parameters for image compression as well as architecture parameters of the single- or
dual-processor system. The power models were the most accurate; the core and device
models have an average relative error of 1.3% and 1.6%, respectively. The execution
time model had a higher error with an average relative error of 18.3%. While the power
models contained little error, the execution time model performance was more variable.
This was largely due to the cross-validation process and the results are still indicative of
a useful predictive model of execution time using high-level algorithm parameters. At
the same time, it was discussed how similar analysis could be applied in order for the
models to be generalized to other system configurations and technologies. The tool also
examines the sensitivity of the model due to errors introduced by including humans in
173
the estimation process. Analysis showed that the power models had very little sensitivity
to human error; the execution time model showed low sensitivity to all but the most
extreme cases of misestimation.
Extending beyond this basic model, this tool provides useful design guidance in a
number of ways. Design space exploration can be performed using the tool, detailing the
optimal architectures based on power consumption, execution time, and resource usage.
These resources can either be the memory blocks used for caches or logic element usage.
Confidence for individual predictions can also be given for the core power and execution
time models using the relationship between the input vector and training data. Due
to the nature of the off-chip device power, prediction confidence could not be given. It
was shown that input vectors that had significant differences from the training data had
statistically larger error than those similar to the training. Finally, the effects of single
event upsets (SEUs) are explored. By assuming that SEUs in multipliers or dividers
cause an incorrect result, the effects of their introduction can be predicted. This was
examined under the assumption that, in the event of an SEU, computations were moved
from hardware to software emulation. Accurately predicting the cost of SEUs relies on
accurate underlying predictive models that are able to model the effect of multiplier
(or divider) types. The tool is able to predict with a mean relative error of 20% the
potential effects of SEUs and the implications on the system for a particular algorithm.
Ultimately, this work has shown a useful tool for designers that allows them to create
more efficient hardware through models that understand the close relationship between
algorithms and the architectures on which they are implemented.
174
7.2 Directions for future work
At the moment, the models are only used to predict the power and performance of an
FPGA-based soft-processor system using SDRAM. However, by characterizing differences
in implementations and power consumption, the model could become more general and
allow for larger design space exploration among not only architecture parameters but
different architectures as well. Specifically, the device power model most depends on the
off-chip memory and not on the implementation of the processor. By characterizing a
different memory technology, such as SRAM or flash memory, this could be used as an
additional parameter in the early design space exploration. The core power model is
specific to the particular FPGA used here. However, the model could be extended to
other FPGAs and soft processors by applying the same process as described here and
subsequently including this in design space exploration.
The subject of optimizations was discussed in Section 5.5.1 where the predictive
performance between variations of the DCT was examined. Ideally, this could be further
extended to include other types of low-level optimization, such as register reuse, custom
instructions, or pipeline depth. In addition to algorithm concerns, additional support
for hardware would be beneficial. This would include concepts such as custom hardware,
heterogeneous systems, or systems that contain more than two soft processors.
By considering additional memory technology, datapath, and algorithm configurations,
this would then become a form of hardware/software co-design in addition to design
space exploration, all of which could be performed early in the design process.
175
Bibliography
[1] Altera, “Instantiating the NIOS 2 Processor,” Altera, Tech. Rep., 2013. [Online]. Available:
http://www.altera.co.uk/literature/hb/nios2/n2cpu nii51004.pdf
[2] O. Azizi, A. Mahesri, B. C. Lee, S. Patel, and M. Horowitz, “Energy-performance tradeoffs
in processor architecture and circuit design: a marginal cost analysis,” in Computer
Architecture Symposium, Proceedings, 2010, 2010, pp. 26–36.
[3] J. R. Bammi, “Software Performance Estimation Strategies in a System-Level Design Tool,”
CODES, pp. 82–86, 2000.
[4] M. F. Barnsley and A. D. Sloan, “Methods and apparatus for image compression by
iterated function system,” 1987.
[5] M. Bellato, P. Bernardi, D. Bortolato, A. Candelori, M. Ceschia, A. Paccagnella, M. Rebau-
dengo, M. Sonza Reorda, M. Violante, and P. Zambolin, “Evaluating the effects of SEUs
affecting the configuration memory of an SRAM-based FPGA,” in Design, Automation,
and Test in Europe, 2004.
[6] C. Brandolese, W. Fornaciari, F. Salice, and D. Scuito, “Fast Software-Level Power
Estimation for Design Space Exploration,” Politecnico di Milano, Tech. Rep., 1999.
[7] L. Breiman, “Bagging predictors,” Machine Learning, vol. 24, no. 2, pp. 123–140, Aug.
1996.
[8] L. Breiman, “Random Forests,” Machine Learning, vol. 45, no. 1, pp. 5 – 32, 2001.
[9] L. Breiman, J. H. Friedman, R. Olshen, and C. Stone, Classification and Regression Trees.
Wadsworth, 1984.
176
[10] G. Bronevetsky and B. de Supinski, “Soft error vulnerability of iterative linear algebra
methods,” in Proceedings of the 22nd annual international conference on Supercomputing -
ICS ’08. New York, New York, USA: ACM Press, 2008, p. 155.
[11] D. Brooks, V. Tiwari, and M. Martonosi, “Wattch: A Framework for Architectural-level
Power Analysis and Optimizations,” SIGARCH Comput. Archit. News, vol. 28, no. 2, pp.
83–94, May 2000.
[12] G. Brown, J. Wyatt, R. Harris, and X. Yao, “Diversity creation methods: a survey and
categorisation,” Information Fusion, vol. 6, no. 1, pp. 5–20, Mar. 2005.
[13] G. Callou, P. Maciel, and E. Carneiro, “A Formal Approach for Estimating Embedded
System Execution Time and Energy Consumption,” in PATMOS, 2009, pp. 379–388.
[14] D. M. Cambre, E. Boemo, and E. Todorovich, “Arithmetic Operations and Their En-
ergy Consumption in the Nios II Embedded Processor,” in International Conference on
Reconfigurable Computing and FPGAs, 2008, Dec. 2008, pp. 151–156.
[15] D. Cambre, “Energy evaluation in the Nios II processor as a function of cache sizes,”
Southern Conference on Reprogrammable Logic, pp. 55–61, 2008. [Online]. Available:
http://ieeexplore.ieee.org/xpls/abs all.jsp?arnumber=4547732
[16] B. Carrion Schafer and K. Wakabayashi, “Machine learning predictive modelling high-level
synthesis design space exploration,” IET Computers & Digital Techniques, vol. 6, no. 3, p.
153, 2012.
[17] C. B. Cho, W. Zhang, and T. Li, “Informed Microarchitecture Design Space Exploration
Using Workload Dynamics,” in IEEE/ACM International Symposium on Microarchitecture,
Proceedings, 2007, 2007, pp. 274–285.
[18] T. L. Chou and K. Roy, “Accurate power estimation of CMOS sequential circuits,” Very
Large Scale Integration (VLSI) Systems, Transactions, vol. 4, no. 3, pp. 369–380, 1996.
[19] T. L. Chou and K. Roy, “Statistical estimation of combinational and sequential CMOS
digital circuit activity considering uncertainty of gate delays,” in Asia and South Pacific
Design Automation Conference. Ieee, 1997, pp. 95–100.
177
[20] J. Clarke, A. Gaffar, G. Constantinides, and P. Y. K. Cheung, “Fast word-level power
models for synthesis of FPGA-based arithmetic,” in 2006 IEEE International Symposium
on Circuits and Systems. Ieee, 2006, p. 4.
[21] G. De’ath and K. E. Fabricius, “Classification and regression trees: a powerful yet simple
technique for ecological data analysis,” Ecology, vol. 81, no. 11, pp. 3178–3192, 2000.
[22] R. Dimond, O. Mencer, and W. Luk, “Application-specific customisation of multi-threaded
soft processors,” in Computers and Digital Techniques, Proceedings, 2006, 2006, pp. 173–
180.
[23] R. Enzler, T. Jeger, D. Cottet, and G. Tro¨ster, “High-level area and performance estimation
of hardware building blocks on FPGAs,” in Field-Programmable Logic and Applications,
2000, pp. 525–534.
[24] R. M. Fano, “The transmission of information,” Massachusetts Institute of Technology,
Tech. Rep., 1949.
[25] Y. Fei, S. Ravi, A. Raghunathan, and N. K. Jha, “A hybrid energy-estimation technique
for extensible processors,” IEEE Transactions on Computer-Aided Design of Integrated
Circuits and Systems, vol. 23, no. 5, pp. 652 – 664, 2004.
[26] M. Felkin, “Comparing Classification Results between N-ary and Binary Problems,” Quality
Measures in Data Mining, no. x, 2007.
[27] Y. Fisher, Fractal Image Compression. Springer, 1995.
[28] Y. Fisher, E. W. Jacobs, and R. D. Boss, “Fractal image compression using iterated
transforms,” in Image and text compression. Boston: Kluwer Academic, 1992, pp. 35 –
62.
[29] N. Fournel, A. Fraboulet, and P. Feautrier, “Fast and Accurate Embedded Systems Energy
Characterization Using Non-intrusive Measurements,” in PATMOS, 2007, pp. 10–19.
[30] J. E. Fowler, “QccPack: An Open-Source Software Library for Quantization, Compression,
and Coding,” in Applications of Digital Image Processing Conference, Proceedings, 2000 ,
2000, pp. 294 – 301.
178
[31] J. H. Friedman, “Greedy function approximation: A gradient boosting machine,” The
annals of statistics, vol. 29, no. 5, pp. 1189–1232, 2001.
[32] C. Gini, “Concentration and dependency ratios,” Rivista di Politica Economica, vol. 87,
pp. 769–789, 1997.
[33] P. Giusto, G. Martin, and E. Harcourt, “Reliable estimation of execution time of embed-
ded software,” in Proceedings Design, Automation and Test in Europe. Conference and
Exhibition 2001. IEEE Comput. Soc, 2001, pp. 580–588.
[34] T. D. Givargis, F. Vahid, and J. Henkel, “Evaluating power consumption of parameterized
cache and bus architectures in system-on-a-chip designs,” VLSI Systems, IEEE Transactions
on, vol. 9, no. 4, pp. 500–508, 2001.
[35] Google, “WebP Image Format,” 2012. [Online]. Available: https://developers.google.com/
speed/webp/
[36] S. Gupta and F. N. Najm, “Power Macromodeling For High Level Power Estimation,” in
Design Automation Conference. Ieee, 1997, pp. 365–370.
[37] S. Gupta and F. N. Najm, “Power modeling for high-level power estimation,” Transactions
on Very Large Scale Integration (VLSI) Systems, vol. 8, no. 1, pp. 18–29, 2000.
[38] S. Gurumurthi, A. Sivasubramaniam, M. J. Irwin, and N. Vijaykrishnan, “Using complete
machine simulation for software power estimation: The softwatt approach,” in High-
Performance Computer Architecture, 2002.
[39] P. Hallschmid and R. Saleh, “Fast Design Space Exploration Using Local Regression
Modeling With Application to ASIPs,” IEEE Transactions on Computer-Aided Design of
Integrated Circuits and Systems, vol. 27, no. 3, pp. 508–515, 2008.
[40] C. X. Huang, B. Zhang, A. C. Deng, and B. Swirski, “The design and implementation
of PowerMill,” Proceedings of the 1995 international symposium on Low power design -
ISLPED ’95, pp. 105–110, 1995.
[41] D. Huffman, “A method for the construction of minimum-redundancy codes,” Proceedings
of the IRE, vol. 27, 1952.
179
[42] M. Ibrahim, M. Rupp, and H. Fahmy, “A precise high-level power consumption model for
embedded systems software,” EURASIP J. Embedded Syst., Jan. 2011.
[43] IJG, “Indepedent JPEG Group,” 2012. [Online]. Available: http://www.ijg.org/
[44] E. Ipek, S. McKee, K. Singh, R. Caruana, B. de Supinski, and M. Schulz, “Efficient
architectural design space exploration via predictive modeling,” ACM Transactions on
Architecture and Code Optimization, vol. 4, no. 4, pp. 1–34, Jan. 2008.
[45] ITU Standard T.832, “JPEG XR Image coding system – Image coding specification,”
2009.
[46] R. Jevtic and C. Carreras, “Analytical high-level power model for lut-based components,”
in PATMOS, 2009, pp. 369–378.
[47] T. Kempf, K. Karuri, S. Wallentowitz, G. Ascheid, R. Leupers, and H. Meyr, “A SW
performance estimation framework for early system-level-design using fine-grained instru-
mentation,” Design Automation and Test in Europe, Proceedings, p. 6 pp., 2006.
[48] R. Kohavi, “A study of cross-validation and bootstrap for accuracy estimation and model
selection,” in International joint Conference on artificial intelligence, Proceedings, 1995,
1995, pp. 1137–1143.
[49] N. Kumar, S. Katkoori, L. Rader, and R. Vemuri, “Profile-Driven Behavioral Synthesis for
Low-Power VLSl Systems,” Design & Test of Computers, vol. 12, no. 3, 1995.
[50] P. Langley and H. Simon, “Applications of machine learning and rule induction,” Commu-
nications of the ACM, 1995.
[51] J. Laurent, E. Senn, N. Julien, and E. Martin, “High-level energy estimation for DSP
systems,” PATMOS, 2001.
[52] B. C. Lee and D. Brooks, “Accurate and efficient regression modeling for microarchitectural
performance and power prediction,” in Architectural support for programming languages
and operating systems, Proceedings, 2006, no. 1, 2006, p. 185.
180
[53] B. C. Lee, D. Brooks, B. de Supinski, M. Schulz, K. Singh, and S. McKee, “Methods of
inference and learning for performance modeling of parallel applications,” in Principles
and practice of parallel programming, Proceedings, 2007, 2007, p. 249.
[54] T. Li and L. K. John, “Run-time modeling and estimation of operating system power
consumption,” International Conference on Measurement and Modeling of Computer
Systems, Proceedings, p. 160, 2003.
[55] S. Lloyd, “Least squares quantization in PCM,” IEEE Transactions on Information Theory,
vol. 28, no. 2, pp. 129–137, Mar. 1982.
[56] C. Lu and D. A. Reed, “Assessing fault sensitivity in MPI applications,” in Proceedings of
the 2004 ACM/IEEE conference on Supercomputing, 2004.
[57] P. Maciel, R. Martins, R. S. Barreto, and F. F. Carvalho, “Towards a Software Power Cost
Analysis Framework Using Colored Petri Net,” in PATMOS, 2004, pp. 362–371.
[58] H. Mehta, R. M. Owens, and M. J. Irwin, “Instruction level power profiling,” IEEE Inter-
national Conference on Acoustics, Speech, and Signal Processing Conference Proceedings ,
vol. 6, pp. 3326–3329, 1996.
[59] Microsemi, “Understanding Soft and Firm Errors in Semiconductor Devices,” Microsemi,
Tech. Rep. December, 2002. [Online]. Available: www.actel.com/documents/SER FAQ.pdf
[60] Microsemi, “Understanding Single Event Effects (SEEs) in FPGAs,” Microsemi, Tech.
Rep. August, 2011. [Online]. Available: http://www.actel.com/documents/SEE WP.pdf
[61] A. Mohsen and R. Hofmann, “Characterizing Power Consumption and Delay of Function-
al/Library Components for Hardware/Software Co-design of Embedded Systems,” in IEEE
International Workshop on Rapid System Prototyping, 2004.
[62] A. Mohsen, R. Hofmann, and U. Erlangen, “Power Modeling , Estimation , and Optimiza-
tion for Automated Co-design of Real-Time Embedded Systems,” in PATMOS, 2004, pp.
643–651.
181
[63] L. W. Nagel and D. O. Pederson, SPICE: Simulation program with integrated circuit
emphasis. Electronics Research Laboratory, College of Engineering, University of California,
1973.
[64] F. N. Najm, “A survey of power estimation techniques in VLSI circuits,” IEEE Transactions
on Very Large Scale Integration (VLSI) Systems, vol. 2, no. 4, pp. 446–455, Dec. 1994.
[65] S. Nikolaidis, N. Kavvadias, T. Laopoulos, L. Bisdounis, and S. Blionas, “Instruction Level
Energy Modeling for Pipelined Processors,” in PATMOS, 2003, pp. 279–288.
[66] E. Normand, “Single event upset at ground level,” Nuclear Science, IEEE Transactions
on, vol. 43, no. 6, pp. 2742–2750, 1996.
[67] G. R. Nudd and D. J. Kerbyson, “PACE: A toolset for the performance prediction of
parallel and distributed systems,” Journal of High Performance Computing Applications,
pp. 228–251, 2000.
[68] OpenJPEG Development Team, UCL-ICTEAM, and B. Macq, “OpenJPEG, a JPEG 2000
open-source software suite,” 2012. [Online]. Available: http://www.openjpeg.org/
[69] J. Ou and V. K. Prasanna, “Rapid energy estimation of computations on FPGA based
soft processors,” in SOC Conference, Proceedings, 2004, 2004, pp. 285–288.
[70] M. S. Oyamada, F. R. Wagner, M. Bonaciu, W. Cesario, and A. Jerraya, “Software Perfor-
mance Estimation in MPSoC Design,” 2007 Asia and South Pacific Design Automation
Conference, pp. 38–43, Jan. 2007.
[71] I. Polian and J. P. Hayes, “Transient fault characterization in dynamic noisy environments,”
IEEE Interational Test Conference, pp. 1–10, 2005.
[72] A. Powell, C. Bouganis, and P. Y. K. Cheung, “Early performance estimation of image com-
pression methods on soft processors,” in Field Programmable Logic Conference, Proceedings,
2012, 2012, pp. 587–590.
[73] A. Powell, C. Savvas-Bouganis, and P. Cheung, “High-level power and performance
estimation of FPGA-based soft processors and its application to design space exploration,”
182
Journal of Systems Architecture, vol. 59, no. 10, pp. 1144—-1156, 2013. [Online]. Available:
http://www.sciencedirect.com/science/article/pii/S1383762113001513
[74] S. R. Powell and P. M. Chau, “A model for estimating power dissipation in a class of DSP
VLSI chips,” Circuits and Systems, IEEE Transactions on., vol. 38, no. 6, pp. 646–650,
1991.
[75] G. Qu, N. Kawabe, K. Usami, and M. Potkonjak, “Function-level power estimation
methodology for microprocessors,” in Proceedings of the 37th Annual Design Automation
Conference, ser. DAC ’00. ACM, 2000, pp. 810–813.
[76] J. M. Rabaey and M. Pedram, Low power design methodologies. Kluwer Academic
Publishers, 1996.
[77] D. Saloman, Data Compression, 4th ed. Springer, 2007.
[78] V. Saxena, F. N. Najm, and I. Hajj, “Monte-Carlo approach for power estimation in
sequential circuits,” in European Design and Test, Proceedings, 1997, pp. 416–420.
[79] M. Schneider, H. Blume, and T. G. Noll, “Power estimation on functional level for
programmable processors,” Advances in Radio Science, vol. 2, pp. 215–219, 2004.
[80] E. Senn, J. Laurent, N. Julien, and E. Martin, “Softexplorer: estimating and optimizing
the power and energy consumption of a C program for DSP applications,” EURASIP J.
Appl. Signal Process., 2005.
[81] L. Senn, E. Senn, and C. Samoyeau, “Modelling the Power and Energy Consumption of
NIOS II Softcores on FPGA,” International Conference on Cluster Computing Workshops,
Proceedings, 2012.
[82] C. E. Shannon, “A mathematical theory of communication,” Bell Laboratories, Tech. Rep.,
1949.
[83] M. Shantharam, S. Srinivasmurthy, and P. Raghavan, “Characterizing the impact of soft
errors on iterative methods in scientific computing,” in Proceedings of the international
conference on Supercomputing - ICS ’11. New York, New York, USA: ACM Press, 2011,
p. 152.
183
[84] D. Sheldon, R. Kumar, R. Lysecky, F. Vahid, and D. Tullsen, “Application-specific
customization of parameterized FPGA soft-core processors,” inIEEE/ACM international
conference on Computer-aided design, Proceedings, 2006, 2006, p. 261.
[85] P. Shivakumar, M. Kistler, S. W. Keckler, D. Burger, and L. Alvisi, “Modeling the
Effect of Technology Trends on the Soft Error Rate of Combinational Logic,”Proceedings
International Conference on Dependable Systems and Networks, pp. 389–398, 2002.
[86] J. G. Silva, J. Carreira, H. Madeira, D. Costa, and P. Moreira, “Experimental assessment
of parallel systems,” in Proceedings of Annual Symposium on Fault Tolerant Computing.
IEEE Comput. Soc. Press, 1996, pp. 415–424.
[87] A. Sinha and A. P. Chandrakasan, “JouleTrack-a Web based tool for software energy
profiling,” in Design Automation Conference. Proceedings, 2001, 2001, pp. 220–225.
[88] K. Skadron, P. S. Ahuja, M. Martonosi, and D. Clark, “Branch prediction, instruction-
window size, and cache size: Performance trade-offs and simulation techniques,” IEEE
Transactions on Computers, vol. 48, no. 11, pp. 1260–1281, 1999.
[89] A. Snavely and L. Carrington, “A framework for performance modeling and prediction,”
in ACM/IEEE Supercomputing, vol. 00, no. c, 2002, pp. 1–17.
[90] H. Soeleman and K. Roy, “Estimating circuit activity in combinational CMOS digital
circuits,” IEEE Design & Test of Computers, vol. 17, no. 2, pp. 112–119, 2000.
[91] Sourceware.org, “Gprof documentation,” 2013. [Online]. Available: http://sourceware.org/
binutils/docs/gprof/
[92] C. Strobl, J. Malley, and G. Tutz, “An introduction to recursive partitioning:
rationale, application, and characteristics of classification and regression trees,
bagging, and random forests.” Psychological methods, 2009. [Online]. Available:
http://psycnet.apa.org/journals/met/14/4/323/
[93] W. Sweldens, “The Lifting Scheme: A Construction of Second Generation Wavelets,” SIAM
Journal on Mathematical Analysis, vol. 29, no. 2, pp. 511–546, 1998.
184
[94] T. K. Tan, A. Raghunathan, G. Lakshminarayana, and N. K. Jha, “High-level software
energy macro-modeling,” Design Automation and Test in Europe, Proceedings, pp. 605–610,
2001.
[95] D. Taubman, “High performance scalable image compression with EBCOT,” IEEE Trans-
actions on Image Processing, vol. 9, no. 7, pp. 1158–1170, 2000.
[96] V. Tiwari, S. Malik, and A. Wolfe, “Power analysis of embedded software: A first step
towards software power minimization,” IEEE Transactions on Very Large Scale Integration
(VLSI) Systems, vol. 2, no. 4, pp. 437 – 445, 1994.
[97] J. G. Tong, I. Anderson, and M. Khalid, “Soft-core processors for embedded systems,”
in International Conference on Microelectronics, 2006, pp. 170–173. [Online]. Available:
http://ieeexplore.ieee.org/xpls/abs all.jsp?arnumber=4243676
[98] E. Touloupis, J. Flint, V. Chouliaras, and D. D. Ward, “Study of the Effects of SEU-Induced
Faults on a Pipeline Protected Microprocessor,” IEEE Transactions on Computers, vol. 56,
no. 12, pp. 1585–1596, Dec. 2007.
[99] G. Tutz, Regression for Categorical Data, 1st ed. Cambridge University Press, 2012.
[100] M. Violante, L. Sterpone, M. Ceschia, D. Bortolato, P. Bernardi, M. S. Reorda, and
A. Paccagnella, “Simulation-based analysis of SEU effects in SRAM-based FPGAs,” IEEE
Transactions on Nuclear Science, vol. 51, no. 6, pp. 3354–3359, Dec. 2004.
[101] Z. Wang, A. Sanchez, A. Herkersdorf, and W. Stechele, “Fast and Accurate Software
Performance Estimation during High-Level Embedded System Design,” Citeseer, pp. 2–7,
2008.
[102] L. Wu and W. Zhang, “A Model Checking Based Approach to Bounding Worst-Case
Execution Time for Multicore Processors,” ACM Transactions on Embedded Computing
Systems, vol. 11, no. August, 2012.
[103] W. Ye, N. Vijaykrishnan, M. Kandemir, and M. J. Irwin, “The design and use of simple-
power: a cycle-accurate energy estimation tool,” in Proceedings of the 37th Annual Design
Automation Conference, ser. DAC ’00, 2000.
185
[104] P. Yiannacouras, J. G. Steffan, and J. Rose, “Exploration and Customization of FPGA-
Based Soft Processors,” IEEE Transactions on Computer-Aided Design of Integrated
Circuits and Systems, vol. 26, no. 2, pp. 266–277, 2007.
[105] R. Zafalon, M. Rossello, E. Macii, and M. Poncino, “Power macromodeling for a high
quality RT-level power estimation,” in Quality Electronic Design, 2000, pp. 59–63.
[106] K. M. Zick and J. P. Hayes, “High-level vulnerability over space and time to insidious
soft errors,” in 2008 IEEE International High Level Design Validation and Test Workshop.
Ieee, Nov. 2008, pp. 161–168.
[107] P. Zipf, H. Hinkelmann, L. Deng, M. Glesner, H. Blume, and T. G. Noll, “A Power
Estimation Model for an FPGA-Based Softcore Processor,” in Field Programmable Logic
Conference, Proceedings, 2007, 2007, pp. 171 – 176.
186
