Towards energy-performance trade-off analysis of parallel applications by Korthikanti, Vijay Anand R.
c© 2011 by Vijay Anand Reddy Korthikanti. All rights reserved.
TOWARDS ENERGY-PERFORMANCE TRADE-OFF ANALYSIS OF PARALLEL
APPLICATIONS
BY
VIJAY ANAND REDDY KORTHIKANTI
DISSERTATION
Submitted in partial fulfillment of the requirements
for the degree of Doctor of Philosophy in Computer Science
in the Graduate College of the
University of Illinois at Urbana-Champaign, 2011
Urbana, Illinois
Doctoral Committee:
Professor Gul A. Agha, Chair and Director of Research
Professor Laxmikant Kale
Professor Maria Gazaran
Professor Mark Greenstreet, University of British Columbia
Abstract
Energy consumption by computer systems has emerged as an important concern, both at the level of individual devices
(limited battery capacity in mobile systems) and at the societal level (the production of Green House Gases). In parallel
architectures, applications may be executed on a variable number of cores and these cores may operate at different
frequencies. The performance and energy cost of a parallel algorithm executing on a parallel architecture have different
trade-offs, depending on how many cores the algorithm uses, at what frequencies these cores operate, and the structure
of the algorithm. The problem of defining metrics to quantify energy performance trade-offs was posed as an important
open problem in a recent NSF workshop. Moreover, in a recent IEEE computer article, Krishna Kant argues that ”A
formal understanding of energy and computation trade-offs will lead to significantly more energy-efficient hardware
and software designs”.
We believe that examining the relation between the performance of parallel applications and their energy require-
ments on parallel processors can be facilitated by analyzing a set of metrics, each for a different purpose. These
metrics can provide programmers with intuitions about the energy required by different parallel applications, thus
guiding the choice of algorithm, architecture, the number of cores to use and the frequency at which to operate them.
Moreover, such metrics would help in the design of more energy efficient algorithms. Towards this goal, we intro-
duce four energy-performance trade-off metrics namely: (a) energy consumption under fixed performance for energy
conservation in time constrained applications, (b) energy bounded performance for improving performance in energy
constrained applications, (c) energy efficiency for energy efficient computing, and (e) cost metric for reducing the
monetary cost associated with running the application.
We have considered the optimization problems (corresponding to the four metrics) for problem instances, rep-
resented as Directed Acyclic Graphs (DAGs), of parallel applications. Moreover, we extend the traditional notion
of scalability to our energy-performance trade-off metrics and proposed corresponding scalability metrics. We pro-
pose methodologies to evaluate the scalability metrics, and illustrated them by analyzing different genre of algorithms
such as low communication overhead (almost embarrassingly parallel) algorithms, dense matrix algorithms, sorting
algorithms, graph based algorithms, and fast Fourier transform algorithms for a message passing model. We also con-
sider a shared memory model (with memory hierarchy). Three algorithms namely, addition, prefix sums and Cole’s
mergesort are analyzed for this shared memory model.
ii
To my parents and brothers, for their love and encouragement.
iii
Acknowledgments
I would like to thank my advisor, Prof. Gul Agha, for his advice, support and intellectual guidance. He has always
been available and willing to discuss anything. He gave me full freedom and unconditional support to explore diverse
areas throughout my graduate education. More importantly, he has taught me the attitude for living.
I would also like to thank my committee members, Prof. Mark Greenstreet, Prof. Laxmikant Kale, and Prof.
Maria Gazaran for their invaluable guidance and feedback on my thesis research. I thank Prof. Mark Greenstreet for
enjoyable discussions about energy-performance trade-off metrics and for continually providing new perspectives and
suggestions. I thank Professor Laxmikant Kale and Prof. Maria Gazaran for their incisive questions and thoughtful
suggestions for my thesis.
I thank Prof. Mahesh Viswanathan for his constant mentoring and guidance over the last three years. He introduced
me to the whole world of probabilistic methods, and has always provided me with key insights and whole-hearted
support on my research and career goals. My research on probabilistic model checking considerably benefited from
his collaboration. I thank Prof. Madhusudan Parthasarathy for his constant guidance and feedback in my research.
His classes on automated software verification introduced me to the whole area of formal verification. I am grateful
to Dr. Sriram Rajamani and Dr. Aditya Nori for their collaboration on the general specification inference problem. I
thank Dr. Ashish Tiwari, and Dr. Sumit Gulwani for providing me an opportunity to work on an interesting new area
of automated tutoring.
I am also grateful to all members of the Open Systems Laboratory, present and past, with whom I had a chance to
interact with. Rajesh Karmani and Vilas Jagannath have been my friends for a long time and I will always remember
the many discussions we have had. I would also like to thank Myungjoo Ham, Liping Chen, Kirill Mechitov, Sameer
Sundresh, Timo Latvala, Amin Shali, Parya Moinzadeh, Reza Shiftehfar, Peter Dinges, Ashish Vulimiri, and Minas
Charalambides. I thank the helpful staff at the Department of Computer Science, in particular, Andrea Whitesell,
Mary Beth Kelley, and Donna Coleman.
Last but not the least, I am very grateful to my parents and brothers who have always given me unconditional love
and emotional support. No words can express my gratitude for their love and confidence in me.
iv
Table of Contents
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
Chapter 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Energy-Performance Trade-off Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Metric Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Scalability Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.5 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Chapter 2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1 Dynamic Power Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Energy-Performance Trade-off Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3 Energy aware DAG Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4 Parallel Computation Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.4.1 Shared Memory Parallel Computation Models . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.4.2 Distributed Memory Parallel Computational Models . . . . . . . . . . . . . . . . . . . . . . 14
2.5 Scalability Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Chapter 3 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.1 Taxonomy of Parallel Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.1.1 Control Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.1.2 Address-Space Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.1.3 Interconnection Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2 Message-Passing Computation Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.2.1 Static Interconnection Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.2.2 Communication Costs in Static Interconnection Networks . . . . . . . . . . . . . . . . . . . 19
3.2.3 Basic Communication Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.3 Shared-Memory Computation Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.4 On-Core Time-Energy Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Chapter 4 Directed Acyclic Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.1 DAG Model and Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.2 Minimizing Energy under Iso-performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.2.1 SMT Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.2.2 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.3 Maximizing Performance under An Energy Budget . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.3.1 SMT Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.3.2 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.4 Maximizing Utility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.4.1 SMT Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.4.2 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
v
Chapter 5 Scalability Metrics: Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.1 Energy Scalability under Iso-performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.2 Energy Bounded Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.3 Utility Based Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
Chapter 6 A Simple Case Study: Parallel Addition Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 51
6.1 Energy Scalability under Iso-performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
6.2 Energy Bounded Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
6.3 Utility Based Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
Chapter 7 Dense Matrix Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
7.1 Matrix Transposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
7.1.1 Checkerboard Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
7.1.2 Striped Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
7.2 Matrix-Vector Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
7.2.1 Rowwise Striping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
7.2.2 Checkerboard Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
7.3 Matrix Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
7.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
Chapter 8 Sorting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
8.1 Sorting Networks: Bitonic Sort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
8.2 Odd-Even Transposition Sort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
8.3 Quick Sort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
8.3.1 Naı¨ve Parallel Quicksort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
8.3.2 Efficient Parallel Quicksort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
8.4 Sample Sort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
8.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
Chapter 9 Graph Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
9.1 Minimum Spanning Tree: Prim’s Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
9.2 Single-Source Shortest Paths: Dijkstra’s Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
9.3 All-Pairs Shortest Paths: Floyd’s Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
9.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
Chapter 10 Fast Fourier Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
10.1 The Binary-Exchange Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
10.2 Two-Dimensional Transpose Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
Chapter 11 Shared-Memory based Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
11.1 Parallel Addition Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
11.2 Parallel Prefix Sums . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
11.3 Parallel Merge Sort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
11.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
Chapter 12 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
vi
List of Tables
7.1 Scalability metrics of dense matrix parallel algorithms on 2D mesh interconnect . . . . . . . . . . . . 79
8.1 Scalability metrics of parallel sorting algorithms on 2D mesh interconnect . . . . . . . . . . . . . . . 102
9.1 Scalability metrics of graph algorithms on 2D mesh interconnect . . . . . . . . . . . . . . . . . . . . 113
12.1 Scalability metrics of parallel algorithms on 2D mesh interconnect . . . . . . . . . . . . . . . . . . . 141
vii
List of Figures
1.1 Big Picture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
3.1 The PEM model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.2 Message passing model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.1 Energy minimization under iso-performance graph of FFT . . . . . . . . . . . . . . . . . . . . . . . 34
4.2 Sensitivity of minimum energy under iso-performance to the ratio of communication to computation
time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.3 Sensitivity of minimum energy under iso-performance to the ratio of communication to computation
energy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.4 Time taken minimization under iso-energy graph of FFT . . . . . . . . . . . . . . . . . . . . . . . . 38
4.5 Sensitivity of minimum time-taken under iso-energy to the ratio of communication to computation
time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.6 Sensitivity of minimum time-taken under iso-energy to the ratio of communication to computation
energy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.7 Cost graph of FFT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.8 Sensitivity of cost to ratio the ratio of communication to computation time . . . . . . . . . . . . . . . 43
4.9 Sensitivity of cost to the ratio of communication to computation energy . . . . . . . . . . . . . . . . 43
6.1 Parallel addition: Energy scalability under iso-performance in message passing model . . . . . . . . . 53
6.2 Sensitivity of energy scalability under iso-performance of parallel addition algorithm to the ratio of
communication to computation energy in message passing model . . . . . . . . . . . . . . . . . . . 53
6.3 Parallel addition: Energy bounded scalability in message passing model . . . . . . . . . . . . . . . . 55
6.4 Sensitivity of energy bounded scalability of parallel addition algorithm to the ratio of communication
to computation energy in message passing model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
6.5 Parallel addition: Utility (cost) graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
6.6 Parallel addition: Utility based scalability (optimal number of cores) . . . . . . . . . . . . . . . . . . 58
6.7 Parallel addition: Utility based scalability (optimal frequency) . . . . . . . . . . . . . . . . . . . . . 58
6.8 Sensitivity of utility based scalability (optimal number of cores) of parallel addition algorithm to the
ratio of communication to computation energy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
6.9 Sensitivity of utility based scalability (optimal frequency) of parallel addition algorithm to the ratio of
communication to computation energy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
6.10 Sensitivity of utility based scalability (optimal number of cores) of parallel addition algorithm to the
ratio of cost of unit energy to the cost of unit time . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
6.11 Sensitivity of utility based scalability (optimal frequency) of parallel addition algorithm to the ratio of
cost of unit energy to the cost of unit time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
11.1 Parallel addition algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
11.2 Parallel addition: Energy scalability under iso-performance in PEM model . . . . . . . . . . . . . . . 127
11.3 Sensitivity of energy scalability under iso-performance of parallel addition algorithm to the ratio of
communication to computation energy in PEM model . . . . . . . . . . . . . . . . . . . . . . . . . 127
11.4 Parallel addition: Energy bounded scalability in PEM model . . . . . . . . . . . . . . . . . . . . . . 130
viii
11.5 Sensitivity of energy bounded scalability of parallel addition algorithm to the ratio of communication
to computation energy in PEM model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
11.6 Parallel prefix sums: Energy scalability under iso-performance in PEM model . . . . . . . . . . . . . 133
11.7 Sensitivity of energy scalability under iso-performance of parallel prefix sums algorithm to the ratio
of communication to computation energy in PEM model . . . . . . . . . . . . . . . . . . . . . . . . 133
11.8 Parallel prefix sums: Energy bounded scalability in PEM model . . . . . . . . . . . . . . . . . . . . 136
11.9 Sensitivity of energy bounded scalability of parallel prefix sums algorithm to the ratio of communica-
tion to computation energy in PEM model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
ix
Chapter 1
Introduction
1.1 Motivation
Computers are consuming an increasingly significant amount of energy: in the US alone, one of the estimates already
put the figure at 13% of the total electricity usage [74]. Thus, computer use represents a significant source of Green
House Gasses and creates a critical problem for sustainability [43]. In addition, limited energy storage capacity has
become a critical problem for mobile devices. Thus, there is a need for energy conservation to be a first class objective
while designing algorithms; Traditionally, algorithms have been designed to maximize performance alone.
As CPU’s hit the power wall, parallel architectures (multicore) have been proposed as a way to increase compu-
tation cycles while keeping power consumption constant. It turns out that in parallel architectures, it is possible to
scale the speed of the individual processors or leave them idle, thus reducing their energy consumption. Specifically,
these processors are equipped with the Dynamic Voltage Frequency Scaling (DVFS) technique, which enables pro-
cessors to be operated at multiple frequencies under different supply voltages. Because the relation between power
and frequency of a core is nonlinear, on a sequential processor, energy consumption can be reduced by lowering the
frequency at which the processor runs. However, lowering the frequency in a uniprocessor leads to an increase in time
taken by an algorithm (i.e., decrease in performance). The picture is more complex in the case of parallel processors.
Parallel computing involves some serial subcomputations, some parallel computation, and interaction between the
parallel subcomputations. Thus parallel performance and energy cost are dependent not only on the number of cores
(and the frequency at which they operate), they are also dependent on the structure of the parallel algorithm.
By increasing the number of cores, the computation that is required at each core may be reduced, which may
in turn improve performance. Specifically, for parallel applications in which there is no interaction between the
computations at each core, doubling cores halves the computation required per core. If the frequency of each of two
slower cores is ≈ 0.8 of the frequency of a faster core, the two cores consume about the same amount of energy as
the faster core, while their overall performance is about 60% higher. However, parallel algorithms in general involve
interaction between subcomputations. Thus, as the number of cores increases, the need for interaction between cores
also increases. This in turn means that more energy is required for interaction. In particular, if we fix an energy budget,
1
using more cores eventually leads to a decrease in the amount of energy left for computation at each core; in turn, this
means that the cores have to run at lower speeds, thereby decreasing performance. Similary, if we fix the performance
bound, using more cores leads to a decrease in energy for computation (by scaling cores appropriately); in turn, this
leads to an increase in energy for communication (Figure 1.1).
Figure 1.1: Big Picture.
Note that optimizing the energy consumed rather than the performance of a parallel application will often lead to
different conclusions about which algorithm, placement, and scheduling strategy to use. For one, power consumption
(i.e., the rate of energy consumption) by a core is (typically) proportional to the cube of its frequency. For another, the
energy and performance characteristics of communication (and of shared memory accesses) and computation differ
between different parallel algorithms. For example, in some applications, communication time may be masked by
overlapping communication and computation (e.g., see [4]). However, the energy required for communication may
be unaffected, whether or not such overlapping can be done. Thus the degree of parallelism used for performance
maximization (under energy budget) will generally different from that required for energy conservation (under time
constraint).
Although we consider two specific models of computation and a particular range of architectural parameters (com-
munication/computation ratios for performance and energy consumption), our methodology is general. The approach
can be applied to other models of computation, whether these models are idealized or realistic ones. Our methodol-
ogy may be used to select algorithms for particular architectures. It may also be used to design the development of
architectures, particularly application specific ones.
2
1.2 Energy-Performance Trade-off Metrics
The problem of defining metrics to quantify energy performance trade-offs was posed as an important open problem
in a recent NSF workshop [13]. In a recent IEEE computer article Krishna Kant argues, ”A formal understanding of
energy and computation trade-offs will lead to significantly more energy-efficient hardware and software designs¨ [52].
Examining the relation between the performance of parallel applications and their energy requirements on parallel
processors can be facilitated by analyzing a set of metrics, each for a different purpose. These metrics can provide
programmers with intuitions about the energy required by different parallel applications, thus guiding the choice of
algorithm, architecture, the number of cores to use and the frequency at which to operate them. Moreover, such
analyses would help in the design of more energy efficient algorithms.
Consider battery operated portable and embedded systems; in such systems, energy consumption is vitally impor-
tant. By lowering energy consumption, battery life and mission duration are extended and more capabilities can be
included in a device for the same battery capacity. Many applications for these computer systems have time constraints
that must be satisfied during the application’s execution. An example of such a time sensitive application is a MPEG
decoder which displays movies with a given frame rate. Other time constrained applications include digital speech
coding, doppler radar based cruise control, collision warning for automobiles, face screening, live maps for driving
direction, voice recognition, highway sensor networks for telematics, and automatic target recognition. Thus, there
is a need for defining metric to estimate energy consumption under fixed performance. Moreover, the dual metric
Energy-constrained performance is desirable when battery power is restricted or the energy is budgeted. If neither
energy nor time is restricted, then the ratio of time taken to the total energy consumed (energy-efficiency) may be an
appropriate metric.
We also consider a cost metric which is a monotonically decreasing function of performance, and a monotonically
increasing function of the amount of energy used in executing the application. In order to make our analysis concrete,
we assume that cost is a linear function of the energy consumed by an application. Our simplifying assumption is
quite reasonable. Energy is often linearly priced per unit (if environmental costs were assessed through a carbon tax,
that too would be factored in the price of the energy). Moreover, we assume that cost is also a linear function of the
time taken in executing an application. This assumption is quite reasonable in some contexts. If applications are run
on a commercial supercomputer or in a cloud, the amount one pays a vendor is roughly linear with the time used.
1.3 Metric Optimization
We will be interested in finding the appropriate number of cores and the frequency at which to operate them to
minimize some energy-performance trade-off metric. Let a problem instance be the execution of a given parallel
3
application for a given input size. We propose the following four metric optimization problems corresponding to the
four energy-performance trade-off metrics described above:
Minimize Energy under Iso-Performance: For a given problem instance and a fixed performance requirement, de-
termine the optimal number of cores and the frequency at which these cores should operate so as to minimize the
energy consumption.
Maximize Performance under Iso-Energy: For a given problem instance and a fixed energy budget, determine the
optimal number of cores and their operating frequencies in order to maximize performance.
Maximize Energy Efficiency For a given problem instance, determine the optimal number of cores and their frequen-
cies for maximizing energy efficiency (the performance/energy ratio).
Maximize Utility For a given problem instance, determine the optimal number of cores and their frequencies that
minimizes the cost function C(P,X) associated with running the problem instance, given that cost is defined as
follows:
C(P,X) = a · E(P,X) + b · T (P,X) (1.1)
whereE(P,X) and T (P,X) represent, respectively, the energy consumed by the parallel algorithm and the time taken
by it, a denotes the cost associated with consuming a fixed unit of energy and b denotes the cost associated with using
(e.g., renting) a parallel computer for a fixed unit of time.
In practice, problem instances of parallel applications in scientific and engineering fields are often modeled as
Directed Acyclic Graphs (DAGs). In particular, precedence relationship between multiple tasks of a problem instance
of a parallel application is modeled as a DAG. A DAG model consists of nodes that represent computations and edges
that represent the dependencies between the nodes. The problem of scheduling DAG based applications both on homo-
geneous and heterogeneous computing systems has been studied extensively over the past few decades [89, 95, 23].
However, most efforts in task scheduling have focused on minimization of application completion time. It is only
recently that much attention has been paid to energy consumption in scheduling, particularly on high-performance
computing systems. Various techniques including dynamic voltage/frequency scaling, dynamic power management
have been studied [44]. Previous efforts addressing the energy-performance trade-offs of DAGs mainly concentrated
on the energy consumption of the processing elements [54, 50, 51, 84, 82, 64, 92]. However, in parallel and dis-
tributed systems with interacting components, communication interfaces additionally contribute to the overall energy
4
consumption and are now consuming high share of the system energy. We strongly feel that analyzing DAG models,
with an appropriate energy model, for our metrics will enable generation of optimal schedules, each for a different
purpose. In this regard, we have solved the above optimization problems (finding optimal schedules) of DAG based
applications with the help of sophisticated constraint solvers in Chapter 4. Furthermore, we analyzed how sensitive
the optimal schedules are to the structural characteristics of the corresponding DAGs.
1.4 Scalability Metrics
Scalability reflects a parallel system’s ability to utilize increasing processing resources efficiently [61]. Using more
processors introduces communication overhead. By increasing problem size, typically more computation can be done
locally; thus, overhead is minimized. A scalable parallel system is one in which a performance metric is always
improved as the number of processors is increased, provided that the problem size is also increased. It is useful to
determine the rate at which the problem size must increase with respect to the number of processors to keep the metric
optimized. Here, problem size is defined to be the number of basic computation steps in the best sequential algorithm
to solve the problem on a single processor.
For different parallel applications, the problem size must increase at different rates in order to maintain a optimal
metric as the number of processors is increased. This rate determines the degree of scalability of the parallel applica-
tion. For instance, in some cases, problem size might need to grow as an exponential function of number of processors
to keep the metric optimized as number of processors increases. Such parallel applications are poorly scalable. The
reason is that for these parallel applications it is difficult to optimize the metric for a large number of processors unless
the problem size is enormous. On the other hand, if the problem size needs to grow only linearly with respect to
number of processors, then the parallel application is highly scalable. That is because it can easily scale (up/down)
the metric proportional to the number of processors for reasonable problem sizes.
Traditionally, the scalability of a parallel algorithm has been associated with parallel efficiency metric (e.g.
speedups) [61]. We believe this notion of scalability can be applied to our energy-performance trade-off metrics.
We propose four scalability metrics corresponding to our four energy-performance trade-off metrics for quantitatively
determining the four degrees of scalability of a parallel application (algorithm), each for a different purpose. Here, for
convenience, we frame the scalability metric to be optimal number of cores required to optimize a metric as function
of problem size. This definition is the inverse of how the problem size grow with increasing processors to keep the
metric optimized. Thus, a higher value for the scalability metric indicate that the application can make effective use
of more cores, and is more scalable than an application with a small value for the metric.
5
Thus, with our turn around definition, the smaller the scalability metric is, the poorer the scalability.
We define the following scalability metrics:
Energy scalability under Iso-performance: Given a parallel algorithm, an architecture model, and a performance
requirement, energy scalability under iso-performance provides the optimal number of cores required to minimize the
energy consumption as function of the problem size [56, 58].
Energy-bounded Scalability: Given a parallel algorithm, an architecture model, and a fixed energy budget, energy-
bounded scalability provides the optimal number of cores required to maximize performance as a function of the
problem size [57].
Energy Efficient Scalability: Given a parallel algorithm and an architecture model, energy efficient scalability pro-
vides the optimal number of cores required and the frequency at which these cores should operate in order to maximize
energy efficiency (the performance/energy ratio) as a function of the problem size.
Utility based Scalability: Given a parallel algorithm and an architecture model, utility based scalability provides the
optimal number of cores and their frequencies that minimizes the cost function C(P,X) (Eq: 1.1) as a function of the
problem size [55].
We evaluate scalability metrics of parallel algorithms based on two standard models of parallel computation–
namely, the shared memory model and the message-passing model. In order to facilitate reasoning about the energy
consumption of parallel algorithms, we associate an energy model (Section 3.4) with these two parallel computation
models. Moreover, we propose methodologies to evaluate the scalability metrics and illustrate them by analyzing dif-
ferent types of algorithms ranging from embarrassingly parallel to those with a strong sequential component. Specif-
ically, we analyze parallel addition algorithm, dense matrix algorithms, sorting algorithms, graph based algorithms,
and fast Fourier transform algorithms for the message passing model; and addition, prefix sums and cole’s mergesort
for the shared memory model. Furthermore, for each of the parallel algorithms, we analyze how sensitive the scalabil-
ity metrics are to changes in parameters such as the ratio of the energy required for a computational operation versus
the energy required for communicating a unit message.
6
1.5 Thesis Outline
The rest of the dissertation is organized as follows. In Chapter 2 we briefly review the literature related to the work in
this dissertation. In Chapter 3, so as to facilitate the energy-performance trade-off analysis of parallel algorithms, we
describe parallel computation models along with an on-core energy model for both shared memory architectures and
message passing architectures.
In Chapter 4, we solve the metric optimization problems corresponding to the four energy-performance trade-off
metrics (Section 1.3) for DAG based applications with the help of sophisticated constraint solvers. Furthermore, we
consider sensitivity of the optimal schedules to both architectural parameters and DAG parameters.
In Chapter 5, we explain our methodology to evaluate energy-performance trade-off scalability metrics. Chapter 6
illustrates our methodology for a simple parallel addition algorithm based on a message passing computational model.
In Chapter 7, we evaluate our scalability metrics for several dense matrix algorithms using a message passing
model. Specifically, we analyze parallel algorithms for matrix transposition, matrix-vector multiplication, and matrix-
matrix multiplication. Moreover, for the first two problems, we consider two parallel algorithms which differ in the
way matrices are partitioned on the cores.
In Chapter 8, we analyze both comparison and non-comparison based sorting algorithms for our scalability metrics.
In the comparison based algorithm category, we analyze bitonic sort, odd-even transposition sort and two different
versions of quicksort. Furthermore, we analyze the non-comparison based sample-sort algorithm.
In Chapter 9, we evaluate our scalability metrics for the following graph algorithms: Prim’s minimum spanning
tree, Dijkstra’s single source shortest path and Flyod’s all-pairs shortest paths. Chapter 10 evaluates two well known
parallel algorithms for the Fast Fourier Transform problem using our scalability metrics.
In Chapter 11, we consider parallel algorithms based on our shared memory computational model. Specifically we
consider parallel addition,, prefix sums and Cole’s mergesort algorithms. Finally, we conclude with plans for future
work in chapter 12.
7
Chapter 2
Related Work
We briefly describe the related work and compare it to the work presented in this dissertation. The body of related
work may be loosely classified into five broad categories: dynamic power management, energy-performance trade-off
metrics, energy aware DAG scheduling, parallel computation models, and scalability metrics.
2.1 Dynamic Power Management
Previous research has studied software-controlled dynamic power management in multicore processors. Researchers
have taken two approaches for dynamic power management. Specifically, they have used one or both of two control
knobs for runtime power performance adaptation: namely, dynamic concurrency throttling, and dynamic voltage and
frequency scaling [22, 66, 45, 80, 31]. Dynamic voltage and frequency scaling (DVFS) has been proven to be a feasible
solution to reduce processor power consumption [40, 41, 69]. By lowering processor clock frequency and supply
voltage during some time slots, for example, idle or communication phases, large reductions in power consumption
can be achieved with only modest performance losses. Moreover, the DVFS techniques have been applied in the
high performance computing fields, for example, in large data centers, to reduce power consumption and achieve high
reliability and availability [37, 18, 19]. Dynamic concurrency throttling (DCT), whereby the level of concurrency is
adapted at runtime based on execution properties, is a software-controlled mechanism, or knob, for runtime power
performance adaptation on systems with multi-core processors.
One line of prior research considers CMP designs with different core configurations to more efficiently accommo-
date different application requirements. Kumar et al. [59, 60] propose heterogeneous CMPs in terms of different core
complexities. They show that such an approach improves power consumption and can provide better performance-
area trade-offs. On the other hand, Ghiasi [32] explores heterogeneity with cores operating at different frequencies.
They show that such heterogeneous systems offer improved management of thermal emergencies. These works focus
on heterogeneous CMP designs and leverage OS support to assign applications to better suited cores.
Oliver et al. describe a multiple clock domain, tile based architecture that assigns different tile columns to different
voltage and frequency domains [79]. This work targets parallelized signal processing applications that are decomposed
8
for execution on different tile columns. Each column can be run at different speeds to meet preset target rates for the
applications with the final goal of reducing power. Juang et al.also look at CMPs with dynamically configured voltage
and frequencies [47]. However, their work adjusts individual core executions to improve power-performance trade-
offs by balancing inter-thread producer-consumer rates. This work employs some predictive strategies to estimate
program behavior at different power modes to help guide an energy-aware scheduler.
Another line of prior work considers dynamic management under power budget constraints. Grochowski et al.
discuss latency and throughput trade-offs under chip power constraints [38]. They survey different adaptation tech-
niques and suggest asymmetric cores with DVFS as the most promising alternative. In a related study, Annavaram et
al. consider a real implementation example with an asymmetric multiprocessor (AMP) [7]. They consider both static
and dynamic AMP configurations and improve performance of multithreaded applications under fixed power budgets.
These works discuss adaptations in response to the available thread parallelism in applications. Li and Martinez in-
vestigate methods to identify the optimal operating point on a CMP in terms of number of active cores and DVFS
settings for parallel applications [67]. They consider dynamic configurations under limited performance and power
constraints and develop analytical models for attainable speedups. They consider the application of chip-wide DVFS
to manage parallel regions of applications and perform a few explorations guided by heuristics to reach an optimal
operating point.
Kadayif et al. propose to shut down idle processors in order to save energy when running nested loops on a
CMP [49]. The authors also study a pre-activation strategy based on compiler analysis to reduce the wake-up overhead
of powered-off processors. Although they address program granularity and power, they do not exploit DVFS in their
solution. In a different work, Kadayif et al. propose using DVFS to slow down lightly loaded threads, to compensate
for load imbalance in a program and save power and energy [48]. They use the compiler to estimate the load imbalance
of array-based loops on single-issue processor cores.
In the context of cache-coherent shared-memory multiprocessors, Moshovos et al. reduce energy consumption by
filtering snoop requests in a bus-based parallel system [75]. Li et al. propose saving energy wasted in barrier spin-
waiting, by predicting a processors stall time and, if warranted, forcing it into an appropriate ACPI-like low-power
sleep state [68]. This work does not consider the number of processors, and does not address power consumption
during useful activity by the parallel application.
In an environment of loosely-coupled web servers running independent workloads, several studies evaluate dif-
ferent policies to control the number of active servers (and thus their performance level) to preserve power while
maintaining acceptable quality of service [28, 27, 81]. Elnozahy et al. evaluate policies that employ various com-
binations of independent and coordinated dynamic voltage/frequency scaling, and node vary-on/vary-off, to reduce
the aggregate power consumption of a web server cluster during periods of reduced workload [28]. They evaluate
9
the policies with simulations, and show that the combination of coordinated voltage/frequency scaling and node vary-
on/vary-off obtains the largest power savings. They only consider dynamic power in their simulations.
In the context of micro-architectures, Heo and Asanovic study the effectiveness of pipelining as a power-saving
tool in a uniprocessor [39]. They examine the relationship between the logic depth per stage and the supply voltage
in deep submicron technology under different conditions. This is complementary to our work, since we study energy-
performance issues of using multiple cores on a parallel processor.
The above body of works provides runtime management tools for energy management. By contrast, we develop
methods for theoretically analyzing parallel algorithms in order to statically determine the scalability metrics. One
advantage is that one may be able to choose the right algorithm and resources a priori. Another advantage of our
approach is that it can provide greater insight into the design of algorithms for energy conservation.
2.2 Energy-Performance Trade-off Metrics
Bingham and Greenstreet used an ETα metric to analyze energy-time trade-offs for hardware implementations of
arithmetic and sorting algorithms using a model inspired by properties of CMOS hardware [10]. Prior to this, various
researchers promoted the use of the ET [35] and ET 2 [72] metrics for modeling the trade-offs. These models present
an abstraction of voltage/frequency scaling to the system designer to enable reasoning about the overall energy/time
trade-offs of the computation. However, there is a significant conceptual gap between these hardware inspired mod-
els and the programming paradigms used to code parallel algorithms. Rather than directly modeling the physical
limits of CMOS circuits, our works presents models that reflect current parallel computers and how they are pro-
grammed. Moreover, single-objective metrics such as ET can be useful but often do not explicitly relate to the
desired performance level. For example, minimum energy-delay product(ET metric) may be achieved by setting the
energy-management knobs such that performance is 20 percent lower than without any energy management, but such
a degradation may be unacceptable.
Li and Martinez develop an analytical model relating the power consumption and performance of a parallel ap-
plication running on a multicore processors [65]. This model considers parallel efficiency, granularity of parallelism,
and voltage/frequency scaling in relating power consumption and performance. However, the model does not consider
total energy consumed by an entire parallel application, or even the structure of the parallel algorithm. Instead, it is
assumed that the algorithmic structure (communication and computation) of a parallel algorithm can be represented
by a parallel efficiency metric. A generic analysis based on this metric is then used–irrespective of the algorithmic
structure.
Wang and Ziavras have analyzed performance energy trade offs for matrix multiplication on an FPGA based
10
mixed-mode chip multiprocessor [93]. Their analysis is based on a specific parallel application executed on a specific
multiprocessor architecture. In contrast, our general methodology of evaluating energy scalability can be used for a
broad range of parallel applications and multicore architectures.
Cho and Melhem studied the interaction between parallelization and energy consumption in a parallelizable appli-
cation [17]. Given the ratio of serial and parallel portion in an application and the number of processors, they derive
the optimal frequencies allocated to the serial and parallel regions in the application to minimize the total energy
consumption, while the execution time is preserved. This analysis is less detailed compared to our energy scalabil-
ity analysis in the sense that they divide the whole parallel application execution into serial and parallel regions and
express total energy as a function of the length of these regions. In other words, they do not consider the structure
(shared memory synchronization and computation) and problem size of the parallel application.
Some researchers have considered the metricαE+T (termed as ’flow plus energy’) [15, 94, 5] to model the energy-
performance trade-offs. Note that, this metric is a specific instance of our metric, utility based scalability. However,
these researchers have studied a different problem: given a set of independent processes, with no communication, and
a set of processors, how can we develop a policy to assign processes to processors and scale the speed of the processors
so that the assignment will satisfy the objective of optimizing ’flow plus energy’? In this dissertation, we analyze a
single parallel algorithm and explicitly consider its potentially complex messaging structure to compute the ’flow plus
energy’ of interacting processes.
2.3 Energy aware DAG Scheduling
The problem of scheduling DAG based applications on both homogeneous and heterogeneous computing systems
has been studied extensively over the past few decades [89, 95, 23]. However, most efforts in task scheduling
have focused on minimization of application completion time. It is only recently that much attention has been paid to
energy consumption in scheduling, particularly for high-performance computing systems. DAG scheduling algorithms
are typically classified into two subcategories: static scheduling algorithms and dynamic scheduling algorithms. In
static task scheduling algorithms, the task assignment to resources is determined before applications are executed.
Information about task execution cost and communication time is assumed to be known at compilation time. Static
task scheduling algorithms normally are non-preemptive - a task is always running on the resource to which it is
assigned [86]. Dynamic task scheduling algorithms assign tasks to resources during execution to achieve load balance
among processors. [25].
Many heuristic based solutions exist for the NP-hard static scheduling problem for parallel processors. An efficient
static DAG scheduling method using Min-Min and Max-Min selective scheduling and meta-heuristics for independent
11
tasks, and groups tasks based on dependency relationship can improve the speedup, make-span and efficiency of a
system [14]. The List scheduling algorithm is the most popular algorithm for static scheduling [76]. List based
scheduling algorithms assign priorities to tasks and sort tasks into a list ordered in decreasing priority. Then tasks are
scheduled based on the priorities.
Energy aware DAG scheduling is determined by the task assignment (mapping tasks to processors), and the fre-
quency selection (which cycles of each task use which frequency level). Present approaches [54, 50, 51, 84, 82, 64, 92]
solve the energy/make-span minimization problem for DAGs on DVFS enabled processors in a two step process. The
first step maps tasks to processors based on the computation time at the maximum voltage with the goal of minimizing
the make-span. The schedule generated from this process is not complete because there may be slack until deadline.
The second step allocates slack to each task so that the total energy consumption is minimized while the deadlines are
met. However, task assignment without consideration of frequency selection may not present the best energy saving
potential. Furthermore, the approach, where task assignment is computed before hand, evaluates eligible tasks one
by one and fails to recognize the joint effect of slowing down certain tasks simultaneously. The joint effect is crit-
ical in finding global optimal frequency settings. To solve this problem, we use a SMT solver based framework to
encode both task assignment and frequency selection for minimizing energy consumption or the make-span; thereby
generating global optimal schedules.
2.4 Parallel Computation Models
Based on the historical development of parallel computational models, they can be classified into two types based on
the memory model of their parallel computer. The first type is the shared memory parallel computational model, such
as the PRAM model [29]; The second type is the distributed memory parallel computational model, such as the BSP
model [90] and the LogP model [21].
2.4.1 Shared Memory Parallel Computation Models
The PRAM model consists of a collection of RAM processors, that load and store data through one common large
global memory. Concurrent access to the common memory is allowed and only takes one unit of time to be finished.
The RAM processor can execute instructions concurrently in unit time and in lock-step with automatic and free
synchronizations. There are several variations of the PRAM model which make different assumptions on concurrent
memory access arbitration. The least restrictive variation is the CRCW PRAM model, which allows concurrent read
and concurrent write to the same memory cell with a mechanism for arbitrating (Priority, Common and Arbitrary)
simultaneous writes to the same cell. Though the PRAM model has the advantages wherein maximum parallelism
12
inherent in an algorithm can be fully exploited and the logical structure of a parallel algorithm can be easily expressed,
it has been widely criticized for ignoring real computer features and the low efficiency of mapping PRAM algorithms
onto real parallel computers. Many research works have been carried out to incorporate various features of real parallel
computers into the PRAM model to enhance its realism.
The standard PRAM model assumes free and step automatic synchronizations among parallel processors. This is
not realistic since there is no need to synchronize each step and there is no cost-free synchronization in real parallel
computers. APRAM [20] and the Asynchronous PRAM [33] relax this by allowing asynchronous execution between
synchronization points. In contrast, the XPRAM [91] only allows periodic synchronizations between asynchronous
executions. Although these models charge no cost for synchronization, they still provide the incentives to synchronize
only when necessary.
The concurrent reads and writes in the PRAM model can cause memory contentions in real computers. Such
situations must be resolved to avoid designing a parallel algorithm with a large amount of memory contentions. The
Module Parallel Computer [73] model gave solutions to this problem by dividing the global common memory into
m modules when each module allows only one memory access in one time step. This model solves the memory
contention at the module level, but doesn’t address the issues of bandwidth and network capacity. Another more
natural extension to the PRAM model is QRQW PRAM [34] model which uses a queue to arbitrate and manage
simultaneous memory access to one memory cell; the cost of memory access on one cell is a function of queue length.
As it became more and more clear that the memory access to non-local data severely affects the performance,
PRAM variants were proposed to remedy this by charging a cost for non-local memory access. The LPRAM [2]
model charge a cost of l unit of time for non-local data access. A more elaborate model of LPRAM is BPRAM [1],
which only charges l unit of time for the first data and b unit of time for the following contiguous data (b is far less than
l). Under the BPRAM model, algorithm design with reference locality and block message transfer are encouraged.
PRAM(m) [71] incorporates bandwidth limitations by restricting the size of global shared memory to m memory
locations. This model is based on the CRCW PRAM model except that in any given step only m accesses can be
serviced.
PRAM model contains no notion of a memory hierarchy; i.e., PRAM does not model the differences in the access
speeds between the private cache on a core and the shared memory that is addressable by all cores. Thus, PRAM
cannot accurately model the actual execution time of algorithms on modern multicore architectures. More recently,
several models emphasizing memory hierarchies have been proposed [11, 9, 8]. In particular, the Parallel External
Memory (PEM) model is an extension of the PRAM model which includes a single level of memory hierarchy [8]. A
more general model is the Multicore model [11] which models multiple levels of the memory hierarchy. In this thesis,
we choose to use the PEM model. Our choice is motivated by the fact that the PEM model is simpler, and we believe
13
it is sufficient to illustrate the trade-offs that we are interested in analyzing.
2.4.2 Distributed Memory Parallel Computational Models
There are several parallel computational models assuming a distributed memory paradigm and the processor commu-
nicates through message passing. Here, we review the BSP and LogP models.
The BSP model is famous for its explicit requirement on synchronization in each superstep, thus the name bulk
synchronization. A number of p processor/memory components, a router that delivers point-to-point messages and the
facilities for synchronization are required to realize a BSP parallel computer. The program executed on the BSP model
consists of a sequence of supersteps with periodicity L. In each superstep, each processor/memory component can
perform the combination of local computation on local available data and message passing. After each period ofL time
units, a global check is performed to ensure all components have finished one superstep, otherwise, another superstep is
allocated for the unfinished superstep. The BSP model posits bandwidth limitation on the algorithm through limiting
the maximum messages that can be sent/received in each superstep, h = L/g, where g is the minimum time gap
between messages. The synchronization is charged at most L time units. To hide the latency of message transmission,
enough parallel slackness is required, thus the BSP model has strong requirement on the capability of latency hiding
in the design and implementation of a parallel algorithm.
LogP model consists of four parameters, represented by the four characters in its name: L (the latency of message
passing); o (overhead of processor involved in message preparation and processing); g (the minimum time interval
between successive messages, its inverse is essentially the bandwidth of the communication); P (the number of com-
puters in this model). Thus, it is different from the BSP model with its additional parameter o, and different meaning of
parameter L and g. The parameter L in the BSP model means the synchronization cost and also implies the message
passing latency since the message can only take effect in the next superstep, and is used to enforce the bulk syn-
chronous in BSP, while L in LogP actually measures the latency of message passing, thus the BSP requires excessive
message latency hiding while the LogP reflects the nature of asynchronous execution; The parameter g (not explicitly
in the BSP model) plays almost the same role in the two models: the gap between contiguous message passing except
different definitions. Usually with these four parameters, it is not easy to design algorithms on the LogP model. LogP
is a generalization of BSP, providing asynchronous communication. Like BSP, LogP has a two-level memory model
providing scope for strict locality, and accounting for the global latency and bandwidth costs of the implementation.
The main drawback of both LogP and BSP models is that they are architecture (interconnect) independent and assume
the communication cost grows more slowly than may be considered realistic.
In this thesis, we employ an alternate approach to parallel algorithm design [61], in which parallel algorithms are
designed in terms of basic data communication operations. In this approach, only the implementation of these oper-
14
ations are optimized for different parallel computers. In practice, a relatively small set of communication operations
form the core of many parallel algorithms. More details about this model are provided in section 3.2.
2.5 Scalability Metrics
Our scalability metrics are in some ways analogous to performance scalability under iso-efficiency as defined by
Kumar et al. [61]; The latter is a measure of an algorithm’s ability to effectively utilize an increasing number of
processors in a multicomputer architecture. Recall that efficiency measures the ratio of the speed-up obtained by an
algorithm to the number of processes used. Kumar measures scalability by observing how large a problem size has to
grow as a function of the number of processors used in order to maintain constant efficiency.
Sun and Rover define a scalability metric called the isospeed measure, which is the factor by which the problem
size has to be increased so that the average unit speed of computation remains constant if the number of processors is
raised from p to p′ [88]. The average unit speed of a parallel computer is defined as its achieved speed(W/Tp) divided
by the number of processors. Thus, if the number of processors is increased to p′ from p, then isospeed(p, p′) =
p′W/pW ′. The problem size W ′ required for p′ processors is determined by the isospeed. For a perfectly paralleliz-
able algorithm with no communication, isospeed(p, p′) = 1 and W ′ = p′W/p. For more realistic parallel systems,
isospeed(p, p′) < 1 and W ′ > p′W/p.
Nussabaum and Agarwal defined scalability of a parallel architecture for a given algorithm as the ratio of the
algorithm’s asymptotic speed when run on the architecture in question to its corresponding asymptotic speedup when
run on an EREW PRAM [78]. The asymptotic speedup is the maximum obtainable speedup for a fixed problem
size given an unlimited number of processors. This metric captures the communication overheads of an architecture
by comparing the performance of a given algorithm on it to the performance of the same algorithm on an ideal
communication free architecture. Informally, it reflects the fraction of the parallelism inherent in a given algorithm
that can be exploited by any machine of that architecture as a function of the problem size.
Jogalekar and Woodside proposed a strategy-based scalability metric for general distributed systems [46]. The
scalability is based on productivity which is defined as the value delivered by the system divided by its cost (money
charge) per unit time. A system is scalable if productivity keeps pace with cost. Their scalability metric measures the
worthiness of renting a service. However, commercial charge varies from customer to customer, based on business
considerations, and does not necessarily reflect the inherent scalability of the underlying computing system.
15
Chapter 3
Models
Estimating energy/performance cost of parallel computers requires a computational model and an energy model.
Two types of computation models (and their corresponding architectures) have been proposed: shared memory [53]
and message passing [21]. A model of energy cost can be associated with any parallel computational model. By
considering a model of each type, we illustrate how this can be done. Prior to that, we provide a broad classification
of the parallel computers along various dimensions.
3.1 Taxonomy of Parallel Architectures
The Parallel computers differ along various dimensions such as control mechanism, address-space organization and
interconnection network.
3.1.1 Control Mechanism
Processing units in parallel computers either operate under the centralized control of a single control unit or work
independently. In architectures referred to as single instruction stream, multiple data stream (SIMD), a single control
unit dispatches instructions to each processing unit. In an SIMD parallel computer, the same instruction is executed
synchronously by all processing units. Computers in which each processor is capable of executing a different program
independent of the other processors are called multiple instruction stream, multiple data stream (MIMD) computers.
SIMD computers require less hardware than MIMD computers because they have only one global control unit. Fur-
thermore, SIMD computers require less memory because only one copy of the program needs to be stored. In contrast,
MIMD computers store the program and operating system at each processor.
Individual processors in an MIMD computer are more complex, because each processor has its own control unit.
It may seem that the cost of each processor must be higher than the cost of a SIMD processor. However, it is possible
to use general-purpose microprocessors as processing units in MIMD computers. In contrast, the CPU used in SIMD
computers has to be specially designed. Hence, due to the economy of scale, processors in MIMD computers may be
both cheaper and more powerful than processors in SIMD computers. In the rest of the thesis we consider only MIMD
16
computers for analyzing parallel algorithms.
3.1.2 Address-Space Organization
Solving a problem on an ensemble of processors requires interaction among processors. The message-passing and
shared-address-space architectures provide two different means of processor interaction.
Message-Passing Architecture
In a message-passing architecture, processors are connected using a message-passing interconnection network. Each
processor is connected to the network. Each processor has its own memory called the local memory, which is acces-
sible only to that processor. Processors can interact only by passing messages. This architecture is also referred to as
a distributed-memory architecture. MIMD message-passing computers are commonly referred to as multicomputers.
Shared-Address-Space Architecture
The shared-address-space architecture provides hardware support for read and write access by all processors to a
shared address space. Processors interact by modifying data objects in the shared address space. MIMD shared-
address-space computers are often referred to as multiprocessors. Most shared-address-space computers also have a
local cache at each processor to increase their effective processor-memory bandwidth. As in sequential computers, a
cache provides faster access to the data contained in the local memory. The cache can also be used to provide fast
access to remotely-located shared data. Whenever processor needs data that is located in a non-local memory, it is
copied into the local cache. Subsequent access of this data is very fast.
3.1.3 Interconnection Networks
Shared-address-computers and message-passing computers can be constructed by connecting processors and memory
units using a variety of interconnection networks. Interconnection networks can be classified as static or dynamic.
Static networks consist of point-to-point communication links among processors and are also referred to as direct
networks. Static networks are typically used to construct message-passing computers. Dynamic networks are built
using switches and communication links. Communication links are connected to one another dynamically by the
switching elements to establish paths among processors and memory banks. Dynamic networks are referred to as
indirect networks and are normally used to construct shared-address-space computers.
17
3.2 Message-Passing Computation Model
For message-passing parallel computers [63], we consider a model where each core possesses a fast local cache
and cores communicate through an interconnection network, as shown in Fig. 3.2. In this section, we discuss some
important interconnection (static) networks. Next, we provide a model for communication costs (both time and energy)
in these networks. Finally, in order to facilitate our energy-performance trade-off analysis of parallel algorithms in
coming chapters, we provide energy and communication costs of basic data communication patterns which forms the
core of many parallel algorithms.
3.2.1 Static Interconnection Networks
Completely-Connected Network
In completely connected network, each processor has a direct communication link to every other processor in the
network. This network is ideal in the sense that a processor can send a message to another processor in a single step,
since communication link exists between them.
Star-Connected Network
In a star-connected network, one processor acts as the central processor. Every other processor has a communication
link connecting it to this processor. The central processor is the bottleneck in the star topology.
Linear Array and Ring
In a linear array network, each processor (except the processors at the ends) has a direct communication link to two
other processors. A wraparound connection is often provided between the processors at the ends. A linear array with
a wraparound connection is referred to as a ring. One way of communication a message between processors is by
repeatedly passing it to the processor immediately to the right (or left, depending which direction yields a shorter
path) until it reaches its destination.
Hypercube Network
A hypercube is a multidimensional mesh of processors with exactly two processors in each dimension. A d-dimensional
hypercube consists of p = 2d processors. A hypercube can be recursively constructed as follows: a zero-dimensional
hypercube is a single processor; a one-dimensional hypercube is constructed by connecting two zero-dimensional
hypercubes; in general, a (d + 1)-dimensional hypercube is constructed by connecting the corresponding processors
18
of two d-dimensional hypercubes. In a d-dimensional hypercube, each processor is directly connected to d other
processors.
Mesh Network
The two-dimensional mesh is an extension of the linear array to two dimensions. In a two-dimensional mesh, each
processor has a direct communication link connecting it to four other processors. If both dimensions of the mesh
contains an equal number of processors, then it is called a square mesh; otherwise it is called a rectangular mesh.
Often, the processors at the periphery are connected by wraparound connections. Such a mesh is called a wraparound
mesh or a torus. A message from one processor to another can be routed in the mesh by first sending it along one
dimension and then along the other dimension until it reaches its destination.
In this thesis, we focus on mesh network because of its following characteristics. First, meshes containing a
large number of processors can be constructed relatively inexpensively. Second, many applications map naturally
onto a mesh network. Third, many algorithms ported directly from networks with a higher degree of connectivity
(such a hypercubes) yield similar performance on mesh architectures. Finally, several commercially-available parallel
computers are based on the mesh network.
3.2.2 Communication Costs in Static Interconnection Networks
The time and energy spent communicating information form one processor to another are often a major source of
overhead when executing programs on a parallel computer. The time taken to communicate a message between
two processors in the network is called communication latency. Communication latency is the sum of the time to
prepare a message for transmission and the time taken by the message to traverse the network to its destination.
Similarly, Communication energy is the sum of the energy spent to prepare as message for transmission and the energy
utilized for transferring the message across the network to its destination. The principal parameters that determine the
communication latency and communication energy are as follows.
Startup time (ts) : The startup time is the time required to handle a message at the sending processor. This includes
the time to prepare the message (adding header, trailer, and error correction information). the time to execute
the routing algorithm, and the time to establish an interface between the local processor and the router. This
delay is incurred only once for a single message transfer.
Per-hop time (th) : The time taken by the header of a message to travel between two directly-connected processors
in the network is called per-hop time. It is also known as node latency.
19
Per-word transfer time (tw) : If the channel bandwidth is r words per second, then each word takes time tw = 1/r
to traverse the link. This time is called the per-word transfer time.
Startup energy (es) : The startup energy is the energy required to handle message at the sending processor. In other
words, it is the energy spent during the startup time.
Per-word-hop energy (ewh) The energy required to transfer a single word between two directly-connected proces-
sors in the network is called per-word-hop energy.
Communication Latency
Many factors influence the communication latency of a network, such as the topology of the network and the switching
techniques. We now provide two switching techniques that are frequently used in parallel computers. The routing
techniques that use them are called store-and-forward routing and cut-through routing.
In store-and-forward routing, when a message is traversing a path with multiple links, each intermediate processor
on the path forwards the message to the next processor after it has received and stored the entire message. Suppose
that a message of size m is being transmitted through such a network. Assume that it traverses l links. At each link,
the message incurs a cost th for the header and mtw for the rest of the message to traverse the link. Since there are l
such links, the total time is (th +mtw)l. Therefore, for store-and-forward routing, the total communication cost for a
message of size m words to traverse l communication links is
tcomm = ts + (mtw + th)l.
In current parallel computers, the per-hop time th is quite small. For most parallel algorithms, it is less than mtw even
for small values of m and thus can be ignored. Hence the above expression can be simplified to
tcomm = ts +mtwl.
Store-and-forward routing makes poor use of communication resources. A message is sent from one processor to
the next only after the entire message has been received. In contrast, cut-through routing advances a message from the
incoming link to the outgoing link as soon as the message arrives. Therefore, cut-through routing uses less memory
and memory bandwidth at intermediate processors and is faster. Consider a message that is traversing such a network.
If the message traverses l links and th is the per-hop time, then the header of the message takes lth time to reach the
destination. If the message is m words long, then the entire message will arrive in time mtw after the arrival of the
20
header of the message. Therefore the total communication time for cut-through routing is given by
tcomm = ts + lth +mtw
This time is an improvement over store-and-forward routing. Note that if the communication is between nearest
neighbors (that is l = 1), or if the message size is small, then the communication time is similar for both routing
schemes.
Communication Energy
Consider a message traversing a network irrespective of the routing mechanism. If the message, which is m words
long, traverses l links and ewh is the per-word-hop energy, it consumesmlewh units of energy to reach the destination.
Therefore the total communication energy for either routing mechanism is given by
ecomm = es +mlewh.
In this thesis, we assume that the startup energy es is much smaller than ewh. Hence, the above expression can be
simplified to ecomm = mlewh.
3.2.3 Basic Communication Patterns
In most parallel algorithms, processors need to exchange data. This exchange of data significantly affects the effi-
ciency of parallel programs by introducing communication delays during the execution. There are a few common
basic patterns of interprocessor communication that are frequently used as building blocks in a variety of parallel
algorithms. Proper implementation of these basic communication operations on various parallel architectures is a key
to the efficient execution of the parallel algorithms that use them.
In this section, we present efficient algorithms for some basic communication operations on the ring and two-
dimensional mesh. Although it is unlikely that large scale parallel computers will be based on the ring topology,
it is helpful to understand various communication operations in the context of rings because the rows and columns
of wraparound meshes are rings. We provide the procedures to implement the basic communication operations for
cut-through(CT) routing scheme. We assume that the communication links are bidirectional; that is, two directly-
connected processors can send messages of size m to each other simultaneously in time ts + twm. We also assume
that a processor can send a message to only one of its links at a time. Similarly, it can receive a message on only one
link at a time. However, a processor can receive a message while sending another message at the same time on the
same or a different link.
21
In the following sections, we describe various communication operations and derive expression of the time and
energy complexity on a ring and mesh parallel architectures.
Simple Message Transfer between Two Processors
Sending a message from one processor to another is the most basic communication operation. Recall from the previous
section that with CT routing, sending a single message containing m words takes ts +mtw + lth time, where l is the
number of links traversed by the message. On an ensemble of p processors, l is at most bp/2c for a ring and 2b√p/2c
for a wraparound square mesh. Thus with CT routing, the time for a single message transfer on ring and mesh has
an upper bound of ts + twm + thbp/2c and ts = twm + 2thb√p/2c respectively. Similarly, the energy for a single
message transfer on ring and mesh has an upper bound of ewhbp/2c and 2ewhb√p/2c respectively.
One-to-All Broadcast
Parallel algorithms often require a single processor to send identical data to all other processors or to a subset of them.
This operation is known as one-to-all broadcast. Initially, one the source processor has the data of size m that needs to
be broadcast. At the termination of the procedure, there are p copies of the initial data, one residing at each processor.
A parallel algorithm may require a single processor to accumulate information from every other processor. This
operation is known as single-node accumulation, and is the dual of one-to-all broadcast. In single-node accumulation,
every processor initially has a message containing m words. The data from all processors are combined through an
associative operator, and accumulated at a single destination processor. The total size of the accumulated data remains
m after the operation. Thus, single-node accumulation can be used to find the sum, product, maximum, or minimum
of a set of number, of perform any associative bitwise operation on elements of the set.
Next, we consider the implementation of one-to-all broadcast in detail on ring and mesh architectures using CT
routing. On a p processor ring, the source processor first sends the data to a processor at a distance p/2. In the second
step, both processors that have the data transmit it to processors at a distance p/4 in the same direction. Assuming
that p is a power of 2, in the ith step, each processor that has the data sends it to a processor at a distance p/2i. All
messages flow in the same direction. The algorithm concludes after log p steps. The communication time and energy
in the ith step is ts + twm+ thp/2i and ewhp/2 respectively. Hence the total time and energy for the broadcast with
CT routing on a ring of p processors is ts log p+ twm log p+ th(p− 1) and (1/2)ewhmp log p respectively.
On a two-dimensional square mesh with CT routing, one-to-all broadcast is performed in two phases. In each
phase the ring procedure is applied in a different dimension of the mesh. First, a one-to-all broadcast is initiated by the
source processor among the
√
p processors in its row (call it source row). Second, a one-to-all broadcast is initiated
in each column by its processor in the source row. Each of the two phases takes (ts + twm) log
√
p + th(
√
p − 1)
22
time. However, first and second phase consume different amounts of energy given by (1/2)ewhm
√
p log
√
p and
(1/2)ewhmp log
√
p respectively. Thus, the time and energy for the entire broadcast is (ts+twm) log p+2th(
√
p−1)
and (1/4)ewhm
√
p(
√
p+ 1) log p respectively.
All-to-All Broadcast
All-to-all broadcast is a generalization of one-to-all broadcast in which all p processors simultaneously initiate a
broadcast. A processor send the same m-word message to every other processor, but different processors may broad-
cast different messages. The dual of all-to-all broadcast is multinode accumulation, in which every processor is the
destination of a single-node accumulation. We now describe all-to-all broadcast on ring and mesh with CT routing.
On a ring, for all-to-all broadcast, all channels are kept busy simultaneously because each processor always has
some information that it can pass along to its neighbor. Each processor first sends to one of its neighbors the data it
needs to broadcast. In subsequent steps, it forwards the data received from one of its neighbors to its other neighbor.
In all-to-all broadcast, p different message circulate in the p-processor ensemble. If communication is performed
circularly in a single direction, then each processor receives all (p−1) pieces of information from all other processors
in (p−1) steps. The time taken and energy consumed by the entire operation is (ts+ twm)(p−1) and ewhmp(p−1)
respectively.
Just like one-to-all broadcast, the all-to-all broadcast algorithm for the 2-D mesh is based on the ring algorithm,
treating rows and columns of the mesh as rings. Once again communication takes place in two phases. In the first
phase, each row of the mesh perform an all-to-all broadcast using the procedure for the ring. In this phase, all proces-
sors collect
√
p messages corresponding to the
√
p processors of their respective rows. Each processor consolidates
this information into a single message of size m
√
p, and proceeds to the second phase of the algorithm. The second
communication phase is a columnwise all-to-all broadcast of the consolidated messages. By the end of this phase, each
processor obtains all p pieces of m-word data that originally resided on different processors. The first phase of
√
p si-
multaneous all-to-all broadcasts (each among
√
p) processors) concludes in time (ts+twm)(
√
(p)−1) and consumes
ewhmp(
√
p − 1) units of energy. The number of processors participating in each all-to-all broadcast in the second
phase is also
√
(p), but the size of each message is now m
√
p. Therefore, this phase takes (ts + twm
√
p)(
√
p − 1)
time and ewhmp
√
p(
√
p − 1) energy units. The time and energy for the entire all-to-all broadcast on a p-processor
two-dimensional square mesh evaluates to 2ts(
√
p− 1) + twm(p− 1) and ewhmp(p− 1) respectively.
One-to-All Personalized Communication
In one-to-all personalized communication, a single processor sends a unique message of size m to every other pro-
cessor. This operation is also know as single-node scatter. The dual of one-to-all personalized communication is
23
single-node gather, in which a single processor collects a unique message from each other processor. The procedure
of single-node gather can be derived for any interconnection topology simply by reversing the direction and sequence
of messages in the corresponding one-to-all personalized communication algorithm. Note that, a gather operation is
different from an accumulation operation in that it does not involve any combination or reduction of data.
We now describe one-to-all personalized broadcast on ring and mesh with CT routing. On a p processor ring, the
source processor initially holds all pmessages. In the first communication step, the source processor sends the message
corresponding to the last processor to its neighbor. Then, in the second communication step, the source processor sends
the message corresponding to the penultimate processor to its neighbor processor. Meanwhile, the neighbor processor
sends a message which it received from source in the first step to its neighbor. In the ith communication step, source
processor sends a message corresponding to the (p− i)th processor to its neighbor and the processors which received
a message in the previous step relay them to their neighbors. The algorithm concludes after (p − 1) steps. The
communication time and energy in the ith step is ts + twm and ewhi respectively. Hence the total time and energy for
the broadcast with CT routing on a ring of p processors is (ts + twm)(p− 1) and (1/2)ewhmp(p− 1) respectively.
For the mesh, the algorithm proceeds in two phases as usual and starts with the source distributing pieces of
m
√
p words among the
√
p processors in its row such that each of these processors receives the data meant for the
√
p processors in its column. In the next phase, one-to-all personalized communication is initiated in each column
by each processor in the source row distributing
√
p words among the
√
p processors. Therefore total time for the
entire one-to-all personalized broadcast on a p-processors two-dimensional mesh is (ts + twm
√
p)(
√
p− 1) + (ts +
twm)(
√
p− 1), which evaluates to 2ts(√p− 1) + twm(p− 1). Furthermore, total energy for the entire broadcast is
(1/2)(ewhm
√
p)
√
p(
√
p− 1) + (1/2)√p(ewhm)√p(√p− 1), which evaluates to ewhmp(
√
(p)− 1).
All-to-All Personalized Communication
In all-to-all personalized communication, also know as total exchange, each processor sends a distinct message of
size m to every other processor. Each processor sends different messages to different processors. We now discuss the
implementation of all-to-all personalized communication on parallel computers with ring and mesh interconnection
networks. The communication patterns of all-to-all personalized communication are identical to those of all-to-all
broadcast on both architectures. Only the size and contents of messages are different.
On a ring, to perform all-to-all personalized communication, every processor sends (p − 1) pieces of data, each
of size m. First, each processor sends all pieces of data as one consolidated message of size m(p − 1) to one of its
neighbors. Of the m(p − 1) words of data received by a processor in this step, one m-word packet belongs to it.
Therefore, each processor extracts the information meant for it from the data received, and forwards the remainder
(p − 2 pieces of size m each) to the next processor. This process continues for p − 1 steps. The size of the message
24
being transferred between processors decreases by m words in each successive step. In every step, each processor
adds to its collection one m-word packet originating from a different processor. Hence, in p−1 steps, every processor
receives the information from all other processors in the ensemble. Since the size of messages transferred in the
ith step is m(p − i) on a ring of processors, the total time taken and energy spent by this operation evaluates to
(ts + (1/2)twmp)(p− 1) and (1/2)ewhmp2(p− 1) respectively.
In all-to-all personalized communication on a p-processors mesh, each processor first groups its p messages ac-
cording to the columns of their destination processors. After the messages are grouped, all-to-all personalized com-
munication is performed independently in each row with clustered messages of size m
√
p. One cluster contains the
information for all
√
p processors of a particular column. Assuming a square mesh, we can compute the time spent
in this phase by substituting
√
p for the number of processors, and m
√
p for the message size in time and energy
equations obtained for the ring interconnection. The result of this substitution is (ts + twmp/2)(
√
p − 1). Since all
rows are active during the first phase, total energy spent in this phase evaluates to (1/2)ewhmp2(
√
p− 1). Before the
second communication phase, the messages in each processor are sorted again, this time according to the rows of their
destination processors; then communication similar to the first phase takes place in all the columns of the mesh. By
the end of this phase, each processor receives a message for every other processor. The time and energy spent in this
phase is same as that in the first phase. Therefore, total time and energy for all-to-all personalized communication for
messages of size m on a p-processors two-dimensional square mesh is (2ts + twmp)(
√
p− 1) and ewhmp2(√p− 1)
respectively.
3.3 Shared-Memory Computation Model
Consider the current generation of shared memory multicore architectures [6]. Such multicore architectures use a hi-
erarchical shared memory. Although the Parallel Random Access Machine (PRAM) models shared memory architec-
tures [53], the PRAM model contains no notion of the memory hierarchy. More recently, several models emphasizing
memory hierarchies have been proposed [11, 9, 8]. In particular, the Parallel External Memory (PEM) model is an
extension of the PRAM model which includes a single level of memory hierarchy [8]. A more general model is the
so-called Multicore model [11]; in this model, multiple levels of the memory hierarchy are represented. For simplicity,
we consider the PEM model; adding energy cost to this model is sufficient to illustrate performance-energy trade-offs.
The PEM model [8] assumes there are P cores and a two-level memory hierarchy consisting of an external memory
(main memory) shared by all the cores, and P internal memories (caches) local to each core. All caches are of a fixed
size M , partitioned in blocks of size B, and owned by a single core (only the owner can access a cache). To perform
an operation on data, a core must have the data in its own cache. Data is transferred between the main memory and
25
the cache in blocks of size B (see Fig 3.1).
Figure 3.1: The PEM model Figure 3.2: Message passing model
Multiple cores can access distinct blocks of the shared memory concurrently. There are three variants of the PEM
model (as in the case with PRAM model); these variants determine how the same block of shared memory may be
accessed by different cores.
• Concurrent Read, Concurrent Write (CRCW): multiple cores can read and write the same block in the main
memory concurrently.
• Concurrent Read, Exclusive Write (CREW): multiple cores can only read the same block concurrently, but
cannot write to it.
• Exclusive Read, Exclusive Write (EREW): there is no simultaneous access of any kind to the same block of the
main memory by multiple cores.
In this thesis we assume that memory accesses (both reads and writes) take a constant amount of time and consume
some constant amount of energy, thus abstracting away the structure of the dynamic interconnection network. In other
words, we do not model memory contention. However, we do not believe it is necessarily useful to consider more
complicated models for shared memory architectures, given that shared memory architectures are themselves not
scalable [77]
26
3.4 On-Core Time-Energy Model
The computation time Tbusy on a given core is proportional to the number of cycles including cache accesses µ
executed on the core. Let X be the frequency of a core, then:
Tbusy = (number of cycles)× 1
X
(3.1)
We denote the time for which a given core is active (not idle) as Tactive.
The following equation approximates the power consumption in a CMOS circuit:
P = CLV
2f + ILV (3.2)
whereCL is the load capacitance, V is the supply voltage, IL is the leakage current, and f is the operational frequency.
The first term corresponds to the dynamic power-consumption component of the total power consumption, while the
second term corresponds to the leakage power consumption.
Recall that a linear increase in the voltage supply leads to a linear increase in the frequency of the core. However,
a linear increase in the voltage supply also leads to a nonlinear (typically cubic) increase in power consumption. Thus,
for simplicity, we model the dynamic and leakage energies consumed by a core, E, to be the result of the above
mentioned critical factors:
Edynamic = Ed · Tbusy ·X3 (3.3)
Eleakage = El · Tactive ·X (3.4)
where Ed and El are some hardware constants [16].
In this thesis, we assume that all cores are homogeneous i.e., they operate at same frequency. We also assume that
frequency of the cores can be varied using a frequency (voltage) probe, and the computation time of the cores can be
scaled (by scaling the frequency of the cores). Because recent processors have introduced efficient support for low
power modes that can reduce the power consumption to near zero, it is reasonable to assume that the energy consumed
by idle cores is zero.
The following parameters and constants are used in the rest of the thesis. F , Maximum frequency of a single core;
Kc0 , Number of cycles executed at maximum frequency in message startup time(Kc0 = tsF ); Kc1 , Number of
cycles executed at maximum frequency in per-word transfer time (Kc1 = twF ); Kc2 , Number of cycles executed at
maximum frequency in per-hop transfer time (Kc2 = thF ).
Given these models which specifies performance and energy consumption, we can compute the various scalability
27
metrics discussed in section 1. This involves analyzing an algorithm to deduce an equation (called the PE equation)
which relates the algorithm’s performance, energy costs, number and frequency of cores, to the input size. Once this
equation is determined, it can be analyzed to compute various optima.
28
Chapter 4
Directed Acyclic Graphs
Because precedence-constrained parallel applications in scientific and engineering fields are the most typical applica-
tion model, the problem of scheduling these applications (task scheduling) both on homogeneous and heterogeneous
computing systems has been studied extensively over the past few decades [89, 95, 23]. However, most efforts in task
scheduling have focused on two issues, the minimization of application completion time (makespan/schedule length)
and time complexity. It is only recently that much attention has been paid to energy consumption in scheduling, partic-
ularly on high-performance computing systems (HPCSs) [85, 70, 83]. The energy consumption issue in these HPCSs
raises various monetary, environmental and system performance concerns.
In this chapter, we optimize various energy-performance trade-off metrics for DAG based parallel applications.
Specifically, we consider the following set of problems.
• Determine the optimal configuration (schedule: optimal number of cores and frequencies of tasks) required for
minimum energy consumption under a performance bound; for simplicity, we fix the performance bound to be
the time taken by the parallel application on a sequential machine running at maximum frequency.
• Determine the optimal schedule of a parallel application (DAG) that maximizes performance (minimize time
taken) under an energy bound; for simplicity, we fix the energy bound to be the energy consumed by running
the parallel application on a sequential processor at maximum frequency.
• Determine the optimal schedule of a parallel application which minimizes the cost (a linear combination of time
and energy Eq. 5.5) associated with running the application.
For each of above problems, we also analyze the sensitivity of the optimal schedule to the parameters k and Kc which
are respectively, the ratio of the energy consumed by a single message transfer to the energy consumed by a single
computation cycle and the ratio of the communication time for single message transfer to the computation time of
single cycle at maximum frequency. Furthermore, we analyze the sensitivity of these optimal schedules to various
DAG structure parameters. For this purpose, we use random DAG structures with appropriate characteristics.
Many scheduling problems are NP-complete. Typically, efficient heuristic algorithms are devised to obtain near-
optimal solutions. Researchers have also used various techniques to obtain exact solutions to these scheduling prob-
29
lems, including Integer Linear Programming (ILP) solvers, Constraint Programming (CP) solvers, Binary Decision
Diagram (BDD) packages, Satisfiability (SAT) solvers, model checkers, etc. In particular, SAT is a well-known NP-
complete problem of assigning values to a set of boolean variables to make a propositional logic formula true. The
formula is typically written in Conjunctive Normal Form (CNF) consisting of a conjunction of boolean disjunctions.
SAT solvers have become amazingly fast in recent years, and a good SAT solver can routinely handle formula with
10000 variables or more. SAT can encode bounded integers with bit vectors, but it cannot encode unbounded types
such as real variables, or infinite structures, such as queues or linked lists. Even for bounded variables, the number
of variables can be very large, and SAT solving can be very slow if there is a large number of variables. Because the
number of boolean variables needed to encode integer variables grows large quickly for large integer values, SAT is
not very suitable for optimization problems involving large integer values.
Satisfiability Modulo Theories (SMT) refers to extension of SAT that add the ability to handle arithmetic and other
decidable theories, e.g., equality with uninterpreted function symbols, linear integer arithmetic, linear real arithmetic,
integer difference logic and real difference logic. In our problem formulation, we use the Linear Integer Arithmetic
(LRA) theory, which can express linear inequalities involving real variables and constants. (If we adopt a continuous
view of time, then we can declare all time-related variables as real numbers instead of integers in the SMT solver,
which can handle real arithmetic as well as integer arithmetic.)
Early attempts for solving SMT instances involved translating them to Boolean SAT instances (e.g., a 32-bit integer
variable would be encoded by 32 boolean variables, and word-level operations such as addition would be replaced by
lower-level boolean operations) then passing this formula to a Boolean SAT solver. This allows us to use existing SAT
solvers and leverage their performance and capacity improvements over time. On the other hand, the loss of high-
level semantics means that the SAT solver has to work much harder than necessary to discover obvious facts such as
x+ y = y + x for integer addition. This observation led to the development of a number of SMT solvers that tightly
integrate the boolean reasoning of a DPLL-style search with theory solvers that handle conjunctions of predicates
from a given theory. This architecture, called DPLL(T), gives the responsibility of boolean reasoning to the DPLL-
based SAT solver which, in turn, interacts with a solver for theory T through a well-defined interface. The theory
solver checks the feasibility of conjunctions of theory predicates passed on to it from the SAT solver as it explores the
boolean search space of the formula. Different SMT solvers may use different theory solvers and different techniques
of integrating them within the DPLL(T) framework. In this chapter, we use the SMT solver Z3 [24] from Microsoft
Research to optimize various energy-performance trade-off metrics of parallel applications, as an alternative to more
conventional optimization techniques such as mixed-integer linear arithmetic. Z3 handles linear arithmetic constraints
by using a Simplex-based linear arithmetic solver that is integrated efficiently in the DPLL(T) framework. The solver
gains efficiency with a number of features such as fast backtracking, a priori simplification to reduce the problem size,
30
and an efficient form of theory propagation. We refer the interested reader to [24] for details of the algorithm.
4.1 DAG Model and Characteristics
A parallel application can be modeled as a Directed Acyclic Graph (DAG),G = (V,E), where V = {ni|i = 1, ..., N}
is a set of nodes representing tasks and E = {ei,j |(i, j) ∈ {1, ..., N} × {1, ..., N}} is a set of directed edges between
nodes, representing communication between tasks. Each task ni has a weight attribute which denotes the number of
computation cycles (or flops) of the task. Moreover, each edge ei,j has a weight, which is the amount of data (in bytes)
that task ni must send to task nj (we call nj a successor of ni and ni a predecessor of nj ). An important fact is that it
is assumed that the redistribution cost between subsequent tasks ni and nj is zero when these tasks are executed on the
same set of processors. The overall execution time of G, or make-span, is defined as the time between the beginning
of Gs entry task and the completion of Gs exit task.
Because irregular DAGs capture the heterogeneous and unpredictable aspects of scientific work flows, in this
chapter, we consider randomly generated irregular application DAGs. As opposed to layered DAGs where all the
tasks in a given level have the same cost (all the transfers between the same two levels share the same communication
cost), irregular DAGs might contain tasks at the same level that have different costs. We use three popular parameters
to define the shape of each DAG: width, regularity, and density. The width determines the maximum parallelism in a
DAG, that is the number of tasks in the largest level. A small value leads to chain graphs and a large value leads to
fork-join graphs. The regularity denotes the uniformity of the number of tasks in each level. A low value means that
levels contain very dissimilar numbers of tasks, while a high value means that all levels contain similar numbers of
tasks. The density denotes the number of edges between two levels of a DAG, with a low value leading to few edges
and a large value leading to many edges.
In addition to these randomly generated DAGs, we also consider DAGs of two High Performance Computing
kernels: Fast Fourier Transformation and Strassens matrix multiplication algorithm. For these two applications graphs,
the shape is fixed by the algorithms but the costs associated to computation and transfer nodes are generated following
the same generation approach as for the random graphs. The FFT task graph can be divided in two parts corresponding
respectively to the recursive calls and the butterfly operations of the algorithm. For d data points, there are 2d − 1
recursive call tasks and (d/2) log d butterfly operation tasks. The main feature of the FFT task graph is that every
path from the start node to any of the exit tasks is a critical path, i.e., computation or communication tasks in a given
level have the same cost. In the FFT-related experiments, we used d, the number of data points as a parameter of our
simulations (2, 4, 8, and 16), to generate FFT-shaped DAGs with different number of tasks (5, 15, 39 and 95) using
the random DAG generation tool [42]. As with the FFT application graph, all the entry tasks of the Strassens matrix
31
multiplication algorithm are on a critical path, and computation or communication tasks in a given level have the same
cost. A Strassen DAG comprises 25 tasks.
In this chapter we follow the convention of using lower-case letters to denote variables, and upper-case letters to
denote constants. The following constants are used in the rest of the chapter.
• Ecomm : Energy consumed for single byte transfer between any two processors.
• Tcomm : Per-word transfer time (inverse of bandwidth; Note that we ignore the startup time)
• F : Maximum frequency of a core.
Furthermore, for each task i, 1 ≤ i ≤ N , we define the following constants and variables.
• Wi : Number of flops associated with task i.
• si : Start time to task i.
• ai : Processor to which task i is assigned (processor count starts from 0).
• fi : Frequency at which task i is executed
Similarly, for each edge ei,j in E, we represent its weight using the constant Ci,j
4.2 Minimizing Energy under Iso-performance
In this section, we are concerned with determining the optimal schedule (task and frequency assignment) of a DAG
application required for minimum energy consumption under a performance bound, given an arbitrary number of
processors and a discrete set of frequencies. Without any loss of generality, we fix the performance bound to be the
time taken by the parallel application on a sequential machine running at maximum frequency.
4.2.1 SMT Formulation
Consider a DAGG = (V,E) withN tasks (vertices). Let Perf andEbound denote the performance budget and energy
upper bound respectively. we define the following constraints
Structural Constraint : For each edge in the DAG G, if the corresponding vertices (first and second) are allocated
on same processor then the starting time of the second task should be at least the sum of the starting time of
the first task and the run time of the first task. Furthermore, if the vertices of the edge are allocated on different
processors then the starting of the second task should be at least the sum of the starting time of the first task, run
32
time of the first task, and the communication delay associated with transferring the message from the processor
associated with the first task to the the processor associated with the second task. More formally,
∀ei,j ∈ E. IF-THEN-ELSE
(
ai = aj , si +
Wi
fi
≤ sj , si + Wi
fi
+ Ci,jTcomm ≤ sj
)
(4.1)
Resource Constraint Given any two tasks on a single processor, they should be ordered properly. In other words,
the two tasks cannot overlap. More formally,
∀i, j ∈ V.
(
ai = aj ⇒
(
si +
Wi
fi
≤ sj
)∨(
sj +
Wj
fi
≤ si
))
(4.2)
Performance Budget Constraint All tasks should be finished before the performance budget Perf . More formally,
∀i ∈ V
(
si +
Wi
fi
≤ Perf
)
(4.3)
Maximum Energy Constraint Total energy consumed by the parallel application should be within the energy bound
Ebound. More formally,
(∑
i∈V
Ed ·Wi · f2i
)
+
 ∑
ei,j∈E∧ai 6=aj
Ecomm · Ci,j
 ≤ Ebound (4.4)
Equations 4.1 to 4.4 form a constraint set that can be fed into a SMT solver to determine feasibility.
Algorithm 1 Top-level binary search algorithm when using SMT.
l = LB
u = UB
while l < u− 1 do
Ebound = (l + u)/2
hasSolution := InvokeSMTSolver(GenSMTModel(Ebound))
if hasSolution then
u := Ebound and record the task schedule
else
l := Ebound
end if
end while
return u as the minimum energy, along with corresponding task schedule.
One drawback of SMT compared to ILP is that it does not support optimization directly, but only provides a yes/no
answer to the feasibility of a given constraint set, so we need to use a binary search algorithm at the top-level to search
for the minimum energy bound Ebound, as shown in Algorithm 1. The SMT solver is invoked as a subroutine to
33
check the feasibility of each possible energy bound Ebound. LB and UB denote the minimum and maximum possible
values of the energy bound, respectively. One should ensure that both LB and UB are a safe lower and upper bounds
respectively. GenSMTModel refers to our automatic code generator that takes as input the energy bound Ebound
currently being checked for feasibility, and generates the SMT problem instance as an input to Z3.
4.2.2 Experimental Evaluation
In this section, we analyze both real world application and random DAGs for energy-performance trade-offs under a
performance bound. Specifically, we perform the following set of experiments. First, we analyze how the minimum
energy (under a performance bound) consumed by these DAG applications changes with the number of cores. To
obtain this result, we add further constraints to make sure that all cores are being utilized, for each specific core
count. Second, we analyze the sensitivity of the optimal number of cores for minimum energy consumption under
a performance bound to two crucial parameters namely, ratio of single message transfer time (between two adjacent
cores) to computation time of one flop at maximum frequency (Kc) and ratio of energy consumed for single message
transfer to energy consumed for single flop at max frequency (k). Finally, we analyze the random DAG applications,
which are generated with appropriate configuration parameters, to see how the optimal number of cores for minimum
energy consumption under a performance bound depend on various DAG configuration parameters.
0
1E+12
2E+12
3E+12
4E+12
5E+12
6E+12
7E+12
8E+12
9E+12
1 2 3 4 5 6
E
n
e
rg
y
 
Number of Cores
Figure 4.1: Energy Minimization under Iso-performance graph of FFT with energy on the Y axis, number of cores on
the X axis with k = 4000 Kc = 10000 d = 4.
Fig. 4.1 plots energy E as a function of number of cores for the FFT application DAG. We can see that, initially
34
energy decreases with increasing p and later on increases with increasing p. As explained earlier, this behavior can be
understood by the fact that the energy for computation decreases with an increase in the number of cores running at
reduced frequencies, and the energy for message transfer increases with an increasing number of cores.
Sensitivity Analysis: Communication-Computation Energy-Time
We now analyze the sensitivity of the optimal number of cores for minimum energy consumption under a performance
bound to two crucial parameters namely, ratio of single message transfer time (between two adjacent cores) to com-
putation time of one flop at maximum frequency and ratio of energy consumed for single message transfer to energy
consumed for single flop at max frequency.
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
10000 20000 30000 40000 50000 60000 70000 80000 90000 100000
O
p
ti
m
a
l N
u
m
b
e
r 
o
f 
C
o
re
s
Communication Time (Cycles at Max Frequency)
Figure 4.2: Sensitivity analysis: optimal number of cores on the Y axis, and Kc (the ratio of a single message transfer
time to the computation time of one flop at maximum frequency) on the X axis with k = 4000 d = 4.
Fig. 4.2 plots the optimal number of cores required for minimum energy consumption by varying the ratio of the
transfer time for a single message to the computation time of one flop at maximum frequency. The results show that
the optimal number of cores required for minimum energy consumption decreases with increasing ratio. As time
communication time increases, the scaling of the cores is restricted to a very small range. This leads to the domination
of the energy increase for communication over the energy decrease due to running more cores at reduced frequency at
a relatively small number of processors. Hence the observation.
35
00.5
1
1.5
2
2.5
3
3.5
4
4.5
O
P
ti
m
a
l N
u
m
b
e
r 
o
f 
C
o
re
s
Communication Energy ( Energy Per Cycle at Max Frequency )
Figure 4.3: Sensitivity analysis: optimal number of cores on the Y axis, and k (the ratio of the energy consumed
for single message transfer to the energy consumed for a single flop at max frequency) on the X axis for FFT with
Kc = 10000 d = 4.
Fig. 4.3 plots the optimal number of cores required for minimum energy consumption by varying k. The results
show that the optimal number of cores required for minimum energy consumption decreases with increasing k.
Sensitivity Analysis: DAG Structure
In our experiments we use values 0.2 and 0.8 for density and regularity and 0.2, 0.5 and 0.8 for width. We refer the
reader to the DAG generation program and its documentation for more details [42]. We now analyze the random DAG
applications, which are generated with the above mentioned configuration parameters, to see how the optimal number
of cores for minimum energy consumption under a performance bound depend on these parameters.
Here, we fix the ratios k and Kc to be respectively, 4000 and 10000. We now consider random DAG structures
with varying widths while fixing both density and regularity to be 0.2. We observe that the optimal number of cores
with width being 0.2,0.5 and 0.8 evaluates respectively to 2, 4 and 6 for a DAG with 20 tasks (N = 20). In other
words, increase in width leads to an increase in the optimal number of cores. This observation is very intuitive; due
to increase in width, the domination of energy increase for communication over the energy decrease due to running
more cores at reduced frequency happens at relatively large number of processors.
We next consider random DAG structures with varying density while fixing width and regularity to be respectively
0.5 and 0.2. We observe that the optimal number of cores decreases with an increase in density ((0.2, 0.8)→ (4, 2)).
36
Since more density corresponds to more communication, an increase in density leads to the domination of the energy
increase for communication over the energy decrease due to running more cores at a reduced frequency to happen at
relatively small number of processors.
We now consider random DAG structures with varying regularity while fixing width and density to be respectively
0.5 and 0.2. We observe that the optimal number of cores increases with an increase in regularity ((0.2, 0.8)→ (4, 6)).
Note that regularity denotes the uniformity of the number of tasks in each level. A low value means that levels contain
very dissimilar numbers of tasks, while a high value means that all levels contain similar numbers of tasks. This
explains our observation.
4.3 Maximizing Performance under An Energy Budget
In this section, we are concerned with evaluating the optimal schedule (task and frequency assignment) of a DAG
application required for maximum performance (minimum time taken) under a energy bound, given arbitrary number
of processors and a discrete set of frequencies. Without loss of generality, we fix the energy bound to be the energy
consumed by the parallel application on a sequential machine running at maximum frequency.
4.3.1 SMT Formulation
Consider a DAG G = (V,E) with N tasks (vertices). Let Ener and Tbound denote the energy budget and time taken
upper bound respectively. we define the following constraints
Structural Constraint and Resource Constraint : Same as in earlier section.
Energy Budget Constraint Total energy consumed by the parallel application should be less than the energy budget
Ener.
(∑
i∈V
Ed ·Wi · f2i
)
+
 ∑
ei,j∈E∧ai 6=aj
Ecomm · Ci,j
 ≤ Ener (4.5)
Maximum Time Constraint All tasks should be finished before the time taken upper bound Tbound. More formally,
∀i ∈ V
(
si +
Wi
fi
≤ Tbound
)
(4.6)
The above equations form a constraint set that can be fed into a SMT solver to determine feasibility. As done
earlier in this section, we use a binary search algorithm at the top-level to search for the minimum time taken bound
37
Tbound, as shown in Algorithm 1 (Ebound replaced with Tbound). The SMT solver is invoked as a subroutine to check
the feasibility of each possible time bound Tbound. LB and UB denote the minimum and maximum possible values of
the time bound, respectively. GenSMTModel refers to our automatic code generator that takes as input the time bound
Tbound currently being checked for feasibility, and generates the SMT problem instance as input to Z3.
4.3.2 Experimental Evaluation
In this section, we analyze both real world application and random DAGs for energy-performance trade-offs under an
energy budget. Specifically, we perform the following set of experiments. First, we analyze how the minimum time
taken (under an energy bound) by these DAG applications changes with the number of cores. To obtain this result,
we add further constraints to make sure that all cores are being utilized, for each specific core count. Second, we
analyze the sensitivity of the optimal number of cores for minimum time taken under an energy bound to two crucial
parameters namely, ratio of single message transfer time (between two adjacent cores) to computation time of one flop
at maximum frequency (Kc) and ratio of energy consumed for single message transfer to energy consumed for single
flop at max frequency (k). Finally, we analyze the random DAG applications, which are generated with appropriate
configuration parameters, to see how the optimal number of cores for minimum time taken under a energy budget
depend on various DAG configuration parameters.
0
5000
10000
15000
20000
25000
30000
1 2 3 4 5 6
T
im
e
 T
a
ke
n
(s
)
Number of Cores
Figure 4.4: Iso-energy graph of FFT with Time taken on the Y axis, number of cores on the X axis with k = 4000
Kc = 10000 d = 4.
Fig. 4.4 plots time taken as a function of the number of cores for the FFT DAG. We can see that, initially time
38
decreases with increasing p and later on increases with increasing p. As explained earlier, this behavior can be
understood by the fact that the time taken decreases with an increase in the number of cores, and the energy remaining
for computation (the difference between the energy budget and the energy used for message transfers) decreases with
increasing cores (making the cores run slower).
Sensitivity Analysis: Communication-Computation Energy-Time
We now analyze the sensitivity of the optimal number of cores for minimum time taken given a fixed energy budget
to two crucial parameters namely, the ratio of the transfer time for a single message (between two adjacent cores) to
the computation time of one flop at the maximum clock frequency and the ratio of the energy consumed for a single
message transfer to the energy consumed for a single flop at maximum frequency.
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
10000 20000 30000 40000 50000 60000 70000 80000 90000 100000
O
p
ti
m
a
l 
N
u
m
b
e
r 
o
f 
C
o
re
s
Communication Time (Cycles at Max Frequency)
Figure 4.5: Sensitivity analysis: optimal number of cores on the Y axis, and Kc (the ratio of a single message transfer
time to the computation time of one flop at maximum frequency) on the X axis with k = 4000 d = 4.
Fig. 4.5 plots the optimal number of cores to minimize execution time taken when varying the ratio of a single
message transfer time to the computation time of one flop at maximum frequency. The results show that the optimal
number of cores required for minimum energy consumption decreases with increasing ratio.
Fig. 4.6 plots the optimal number of cores required for minimum energy consumption by varying k. The results
show that the optimal number of cores required for minimum energy consumption decreases with increasing k.
39
01
2
3
4
5
6
O
p
ti
m
a
l N
u
m
b
e
r 
o
f 
C
o
re
s
Communication Energy (Energy Per Cycle at Max Frequency)
Figure 4.6: Sensitivity analysis: optimal number of cores on the Y axis, and k (the ratio of energy consumed for single
message transfer to the energy consumed for single flop at max frequency) on the X axis with Kc = 10000 d = 4.
Sensitivity Analysis: DAG Structure
We now analyze the random DAG applications, which are generated with the same configuration parameters as in
section 4.2, to see how the optimal number of cores for minimum energy consumption under a performance bound
depend on these parameters.
Here, we fix the ratios k and Kc to be respectively, 4000 and 10000. We now consider random DAG structures
with varying widths while fixing both density and regularity to be 0.2. We observe that the optimal number of cores
with width being 0.2, 0.5 and 0.8 evaluates respectively to 3, 4 and 6. In other words, increase in width leads to an
increase in the optimal number of cores. We next consider random DAG structures with varying density while fixing
width and regularity to be respectively 0.5 and 0.2. We observe that the optimal number of cores decreases with an
increase in density ((0.2, 0.8) → (4, 3)). Since more density corresponds to more communication, an increase in
density leads to the domination of the energy increase for communication over the energy decrease due to running
more cores at reduced frequency happening at relatively small number of processors.
We now consider random DAG structures with varying regularity while fixing width and density to be respectively
0.5 and 0.2. We observe that the optimal number of cores increases with an increase in regularity ((0.2, 0.8)→ (4, 6)).
Note that regularity denotes the uniformity of the number of tasks in each level. A low value means that the levels
contain very dissimilar numbers of tasks, while a high value means that all levels contain similar numbers of tasks.
This explains our observation.
40
4.4 Maximizing Utility
In this section, we are concerned with evaluating the optimal schedule (task and frequency assignment) of a DAG
application required to minimize the cost function (Equation 5.5), given an arbitrary number of processors and a
discrete set of frequencies.
4.4.1 SMT Formulation
Consider a DAGG = (V,E) withN tasks (vertices). LetCbound denote the cost upper bound. we define the following
constraints
Structural Constraint and Resource Constraint : Same as in earlier section.
Cost Budget Constraint : The total cost of the parallel application should be less than the cost upper bound Cbound.
∀i ∈ V
α · (si + Wi
fi
)
+
∑
j∈V
Ed ·Wj · f2j
+
 ∑
ex,y∈E∧ax 6=ay
Ecomm · Cx,y
 ≤ Cbound (4.7)
These equations forms a constraint set that can be fed into a SMT solver to determine feasibility. As done in earlier
sections, we use a binary search algorithm at the top-level to search for the minimum cost bound Cbound, as shown in
Algorithm 1 (Ebound replaced with Cbound). The SMT solver is invoked as a subroutine to check the feasibility of
each possible cost bound Cbound. GenSMTModel refers to our automatic code generator that takes as input the cost
bound Cbound currently being checked for feasibility and generates the SMT problem instance as input to Z3.
4.4.2 Experimental Evaluation
In this section, we analyze both real world application and random DAGs for energy-performance trade-offs. Specif-
ically, we perform the following set of experiments. First, we analyze how the cost (minimal) function of these DAG
applications changes with the number of cores. To obtain this result, we add further constraints to make sure that
all cores are being utilized, for each specific core count. Second, we analyze the sensitivity of the optimal number
of cores for minimum cost to two crucial parameters namely, the ratio of a single message transfer time (between
two adjacent cores) to the computation time of one flop at maximum clock frequency (Kc) and the ratio of energy
consumed for a single message transfer to the energy consumed for a single flop at maximum frequency (k). Finally,
we analyze the random DAG applications, which are generated with appropriate configuration parameters, to see how
the optimal number of cores for minimum cost depends on various DAG configuration parameters.
41
05E+12
1E+13
1.5E+13
2E+13
2.5E+13
3E+13
1 2 3 4 5 6
U
ti
li
ty
 C
o
st
Number of Cores
Figure 4.7: Cost graph of FFT with energy on the Y axis, number of cores on the X axis with k = 4000 Kc = 10000
and α = 1000000000 d = 4.
Fig. 4.7 plots cost as a function of the number of cores for the FFT application DAG. We can see that, initially
cost decreases with increasing p and later on increases with increasing p. This is due to the fact that the total energy
consumed increases and the time taken decreases as either the number of cores or the frequency of the cores increases.
Sensitivity Analysis: Communication-Computation Energy-Time
We now analyze the sensitivity of the optimal number of cores for minimum cost to two crucial parameters namely,
ratio of a single message transfer time (between two adjacent cores) to the computation time of one flop at maximum
clock frequency and the ratio of energy consumed for a single message transfer to the energy consumed for a single
flop at maximum frequency.
Fig. 4.8 plots the optimal number of cores required for minimum cost by varying the ratio of a single message
transfer time to the computation time of one flop at maximum clock frequency. The results show that the optimal
number of cores required for minimum energy consumption decreases with increasing ratio.
42
00.5
1
1.5
2
2.5
3
3.5
4
4.5
10000 20000 30000 40000 50000 60000 70000 80000 90000 100000
O
p
ti
m
a
l 
N
u
m
b
e
r 
o
f 
C
o
re
s
Communication Time (Cycles at Max Frequency)
Figure 4.8: Sensitivity analysis: optimal number of cores on the Y axis, and Kc (the ratio of a single message transfer
time to the computation time of one flop at maximum frequency) on the X axis with k = 4000 α = 1000000000
d = 4.
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
O
p
ti
m
a
l N
u
m
b
e
r 
o
f 
C
o
re
s
Communication Energy (Energy Per Cycle at Max Frequency)
Figure 4.9: Sensitivity analysis: optimal number of cores on the Y axis, and k (the ratio of the energy consumed for
a single message transfer to the energy consumed for one flop at max frequency) on the X axis with Kc = 10000
α = 1000000000.
Fig. 4.9 plots the optimal number of cores required for minimum cost versus k. The result shows that the optimal
number of cores required for minimum energy consumption remains constant for wide range of k.
43
Sensitivity Analysis: DAG Structure
We now analyze the random DAG applications, which are generated with the earlier mentioned configuration parame-
ters, to see how the optimal number of cores for minimum cost depend on these parameters. Here, we fix the ratios k,
Kc and α to be respectively, 4000, 10000, and 1000000000. We now consider random DAG structures with varying
widths while fixing both density and regularity to be 0.2. We observe that the optimal number of cores with width
being 0.2, 0.5 and 0.8 evaluates respectively to 2, 4 and 6. In other words, an increase in one width leads to an increase
in the optimal number of cores. We next consider random DAG structures with varying density while fixing width and
regularity to be respectively 0.5 and 0.2. We observe that the optimal number of cores decreases with an increase in
density ((0.2, 0.8) → (4, 2)). Since more density corresponds to more communication, an increase in density leads
to the domination of the energy increase for communication over the energy decrease due to running more cores at
reduced frequency happening at a relatively small number of processors. We now consider random DAG structures
with varying regularity while fixing width and density to be respectively 0.5 and 0.2. We observe that the optimal
number of cores increases with an increase in density ((0.2, 0.8)→ (4, 6)).
44
Chapter 5
Scalability Metrics: Methodology
In this chapter, we present our methodology to evaluate scalability metrics namely, energy scalability under iso-
performance, energy bounded scalability and utility based scalability for a given parallel algorithm (as described
in Section 1.4). We illustrate our methodology for a simple parallel addition algorithm in the next chapter. In the
following chapters, we evaluate these scalability metrics for various genres of parallel algorithms.
5.1 Energy Scalability under Iso-performance
Parallel algorithms are parameterized by the number of cores on which they may be executed (usually from a single
core to some large number). We define the performance of a parallel algorithm as the time required for the completion
of a problem instance. Given a parallel algorithm, an architecture model, and a performance requirement, energy
scalability under iso-performance provides the optimal number of cores required to minimize the energy consumption
as function of the problem size [56, 58].
We now present our methodology to evaluate energy scalability under iso-performance for a given parallel algo-
rithm A on a given parallel computation model.
Step 1 Find the critical path piA in the execution of A. Note that the critical path is the longest path through the task
dependency graph (where edges represents task serialization) of parallel algorithm. The length of the critical path can
be determined by measuring the execution of the longest thread. Note that the critical path length gives a lower bound
on execution time of a parallel algorithm.
Step 2 Partition piA into memory accesses (reads and writes), message transfers, synchronization breaks and compu-
tation steps.
Step 3 Evaluate the sum total of computation cycles at all cores (Scomp).
45
Step 4 Evaluate the energy complexity Ecomm of memory accesses (for shared memory model) and message transfers
(for message passing model) of A.
Step 5 Scale the computation steps of piA so that the parallel performance of A matches the specified performance
requirement. We do this by scaling the computation time of piA to the difference between (a)the required performance
and (b) the time taken for memory accesses, message transfers and synchronization breaks in the critical path. Thus,
the new reduced frequency at which all p cores should run is given by:
X =
pincomp
Performance Target− pitcomm
where pincomp denote the number of computation cycles in the critical path and pitcomm denote the communication
time of the critical path.
Step 6 Evaluate the total active time at all the cores assuming the frequency obtained in Step 3. Observe that, scaling
piA may lead to an increase in active time in other paths (at other cores). The total active time at all the cores at new
frequency is given by
Tactive = p(pincomp · 1
X
+ pitcomm) = p · Performance Target
Step 7 Frame an expression for energy consumption of the parallel algorithm using the energy model. The energy
expression is the sum of the energy consumed by 1) computation, Ecomp, 2) communication Ecomm and 4) leakage,
Eleak
Ecomp = Ed · Scomp ·X2 (5.1)
Eleak = El · Tactive ·X (5.2)
Note that Ecomp is lower if the cores run at a lower frequency, while Eleak may decrease as the active cores take
longer to finish. Ecomm may increase as more cores are used since the computation is more distributed.
Step 8 Analyze the equation to obtain the number of cores required for minimum energy consumption as a function
of problem size. In particular, we compute the appropriate number of cores that are required to guarantee a required
level of performance.
46
5.2 Energy Bounded Scalability
Given a parallel algorithm, an architecture model, and a fixed energy budget, energy-bounded scalability provides the
optimal number of cores required to maximize performance as a function of the problem size [57]. Energy-bounded
scalability may be used in mobile devices which have strong energy constraints.
Note that this problem is a dual characterization to the problem of energy scalability under iso-performance.
Obviously, for embarrassingly parallel algorithms, whether one wants to optimize the number of cores for a given
energy budget, or the energy for a given performance requirement, it is best to use a maximal number of cores.
However, in general, the two analyses give different results. This is can be understood by appreciating the following
difference. Energy scalability under iso-performance minimizes total energy consumed by an algorithm. The total
energy consumed is a sum of energy consumed in all paths executed by the parallel algorithm. On the other hand,
energy bounded scalability analysis optimizes performance: performance is measured by considering the length of the
longest path in the execution of a parallel algorithm.
We now present our methodology to evaluate energy-bounded scalability for a given parallel algorithm A on a
given parallel computation model. Steps 1 − 4 of the methodology are the same as those of energy scalability under
iso-performance.
Step 5 Evaluate the total active time (Tactive) at all the cores as a function of the frequency of the cores.
Tactive = p(pincomp · 1
X
+ pitcomm)
Step 6 Frame an expression for the energy consumed by the parallel algorithm as a function of the frequency of the
cores, using the energy model. The energy expression is the sum of the energy consumed by 1) computation, Ecomp,
2) communication Ecomm and 2) leakage, Eleak
Ecomp = Ed · Scomp ·X2
Eleak = El · Tactive ·X
= El · p(pincomp + pitcomm ·X)
Step 7 Given an energy budget E, evaluate the frequency X with which the cores should run, as a function of
E. Note that both Ecomp and Eleak depend on the frequency of the cores. In particular, solve the equation E =
Ecomp + Ecomm + Eleak; the solution X is given by
47
X =
−Elppitcomm +
√
E2l p
2pi2tcomm + 4EdScomp(E − Ecomm − Elppincomp)
2EdScomp
Step 8 Express the time taken (inverse of performance) by the parallel algorithm as a function of the frequency
obtained in step 7:
Time Taken = pitcomm + pincomp · 1
X
(5.3)
Step 9 Analyze the equation to obtain the number of cores required for maximum performance as a function of prob-
lem size. In particular, compute the appropriate number of cores that are required to maximize the performance under
the energy budget constraint.
5.3 Utility Based Scalability
In utility theory, the utility function of an agent is a function that ranks all pairs of consumption bundles by order of
preference (completeness) such that any set of three or more bundles forms a transitive relation. This means that for
each bundle (x, y) there is a unique relation, U(x, y), representing the utility (satisfaction) relation associated with the
bundle (x, y). The relation (x, y) → U(x, y) is called the utility function. The range of the function is a set of real
numbers. The actual values of the function have no importance. Only the ranking of those values is significant in the
theory.
We use a utility function based on the cost associated with running a parallel algorithm. In our case, the consump-
tion bundle is the pair consisting of the number of cores used, p, and the frequency at which these cores operate, X .
Formally, for a parallel algorithm, a utility function Ucost(p,X) is defined as negative cost, and cost C is defined as
follows:
C(p,X) = a · E(p,X) + b · T (p,X) (5.4)
where E(p,X) and T (p,X) represent, respectively, the energy consumed by the parallel algorithm and the time taken
by it, a denotes the cost associated with consuming a single unit of energy and b denotes the cost associated with using
(e.g., with renting) a parallel computer for a single unit of time. We are interested in finding the appropriate bundle,
(p,X), which maximizes the utility function (i.e., minimizes the cost). Without any loss of generality, we consider
the following simplified cost function C(p,X):
C(p,X) = α · E(p,X) + T (p,X) (5.5)
48
where α = (a/b).
We are interested in following question: given a parallel algorithm, an architecture model, and the ratio α, what
is the optimal number of cores and their frequencies that minimizes the cost function C(p,X) as a function of problem
size. In other words, we are interested in finding the appropriate configuration of the parallel computer (number of
cores and their frequencies) such that the cost of executing the parallel algorithm on the parallel system is minimized.
Note that using more than the optimal number of cores will lead to an increase in the amount of energy consumed
without a corresponding advantage of decreasing the overall cost. In other words, using more cores than the optimal
number will lead to an energy waste (as measured by the utility function).
We now describe our methodology to determine for a given parallel algorithm as a function of the problem size,
the optimal number of cores and the frequencies at which they should operate. By optimal we mean that the cost
function C(p,X) associated with executing the algorithm is minimized at the particular values of p and X .
As an initial step, we evaluate energy consumed E(p,X) by the execution of the parallel algorithm, and the (total)
time taken T (p,X) in the execution. We do this by the following series of steps: Steps 1-5 are the same as in the
methodology for evaluating energy bounded scalability.
Step 6. Using the energy model, frame an expression for the energy consumed by the parallel algorithm as a function
of the frequency of the cores E(p,X). The energy expression E is the sum of the energy consumed by: 1) the
computation carried out by the algorithm, Ecomp, 2) the communication required by the algorithm, Ecomm, and 3)
leakage when cores are idle, Eleak. Energy consumption for each of these components is given by the following
equations:
Ecomp = Ed · (Scomp) ·X2
Eleak = El · Tactive ·X
= El · p(pincomp + pitcomm ·X)
Step 7. Compute the execution time of the parallel algorithm, T (p,X):
T (p,X) = pitcomm + pincomp · 1
X
(5.6)
Note that the execution time of an algorithm corresponds to the inverse of its performance.
After obtaining expressions for energy consumption and execution time, we now frame an expression for the cost
function C(p,X) of the parallel algorithm using Eq. 5.5:
C(p,X) = α · (Ecomp + Ecomm + Eleak) + T (p,X) (5.7)
49
Finally, we analyze the cost expression to obtain the number of cores and the frequencies at which they should
operate in order to minimize the cost as a function of the problem size.
50
Chapter 6
A Simple Case Study: Parallel Addition
Algorithm
In this chapter, we illustrate our methodology for evaluating energy-performance trade-off scalability metrics using a
simple parallel addition algorithm on a 2-dimensional square mesh. Note that, parallel addition algorithm falls under
the almost embarrassingly parallel algorithm category because of its low communication overhead (sub-linear).
The sequential addition algorithm is straightforward, it takes n−1 additions to compute the sum of n numbers. The
running time and energy of the sequential algorithm are given by Tseq = β ·(n−1) ·(1/F ) andEseq = Edβ(n−1)F 2
respectively. Note that problem size (W ) of the addition algorithm is same as the input size (n). The parallel addition
algorithm adds n numbers using p cores. Initially all n numbers are assumed to be distributed equally among the p
cores; at the end of the computation, one of the cores stores their sum. Without loss of generality, we assume that the
number of cores available is some power of two. The algorithm runs in log p steps. In the first step, half of the cores
send the sum they compute to the other half so that no core receives a sum from more than one core. The receiving
cores then add the number the local sum they have computed. We perform the same step recursively until there is
only one core left. At the end of computation, a single core will store the sum of all n numbers. Note that above
communication pattern corresponds to single-node accumulation.
In the above algorithm, the critical path is the execution path of the core that has the sum of all numbers at the end.
As described earlier in section 3.2.3, time taken and energy spent for communication in single node accumulation on
a 2-dimensional mesh are (ts + tw) log p+ 2th(
√
p− 1) and (1/4)ewh√p(√p+ 1) log p respectively. Furthermore,
total number of computation steps at all cores during the single node accumulation communication phase is p − 1.
Therefore, total number of computation steps (additions) in the critical path and total number number of computation
steps at all cores are n/p− 1 + log p and (n/p− 1)p+ (p− 1) = n− 1 respectively. In summary,
pincomp = (n/p− 1 + log p)β
pitcomm = (ts + tw) log p+ 2th(
√
p− 1)
Scomp = (n− 1)β
Ecomm = ((1/4)ewh
√
p(
√
p+ 1) log p)
51
where β denotes number of cycles required for single addition operation.
6.1 Energy Scalability under Iso-performance
In order to evaluate the energy scalability under iso-performance, we first scale the computation steps of critical path
so that the parallel performance of the parallel addition matches the specified performance requirement. Assuming
the performance target to be the time taken by the best sequential algorithm on a single core (processor) operating at
maximum frequency F , the new reduced frequency at which all p cores should run is given by:
X =
(n/p− 1 + log p)β
(n− 1) βF − ((ts + tw) log p+ 2th(
√
p− 1))
= F
(n/p− 1 + log p)β
(n− 1)β − ((kc0 +Kc1) log p+ 2Kc2(
√
p− 1))
The total active time at all the cores at new frequency is given by p(n−1)(β/F ). Therefore, the expression for energy
consumption of the parallel addition algorithm as per equation 5.1 is given by
E = Ecomp + Ecomm + Eleak
= Ed(n− 1)βX2 + (1/4)ewh√p(√p+ 1) log p+ Elp(n− 1) β
F
X
We now analyze the energy expression obtained above for the addition algorithm to evaluate energy scalability
under iso-performance. While we could differentiate the function with respect to the number of cores to compute the
minimum, this results in a rather cumbersome expression. Instead, we analyze the graphs expressing energy scalability
under iso-performance. We later provide an asymptotic analysis for the same.
Note that the energy expression is dependent on many variables. We can simplify this expression without loss
of generality. We set leakage energy constant El to 1. We express all energy values with respect to this normalized
energy value. In order to graph the differential, we make some assumptions about the other parameters. While these
assumptions compromise generality, we can consider the sensitivity of the analysis to a range of values for these
parameters. One parameter is the the energy consumed for single cycle at maximum frequency as a multiple of
leakage energy constant. We assume this ratio to be 10, i.e., that Ed · F 2 = 10 · El. It turns out that this parameter is
not very significant for our analysis; in fact, large variations in the parameter do not significantly affect the shapes of
the graphs for the parallel algorithms we have studied.
Another parameter, k, represents the ratio of the energy consumed for single message transfer per hop, ewh, and
the energy consumed for executing a single instruction at the maximum frequency. Thus, ewh = k · Ed · F 2. We
52
Figure 6.1: Addition: Energy curve with energy on the
Z axis, number of cores on the X axis and input size on
the Y axis with k = 500, β = 1, kc2 = 5. The black
curve on the XY plane is the plot of the optimal num-
ber of cores required for minimum energy consumption
with varying input size.
Figure 6.2: Sensitivity analysis: input size on the Z
axis, optimal number of cores on the X axis, and k (the
ratio of energy consumed for a single message commu-
nication to the energy consumed for executing single
instruction at maximum frequency) on the Y axis.
analyze the sensitivity of our results to a range of values of k. Fig. 6.1 plots the energy E of parallel addition as a
function of input size and the number of cores. We can see that for any input size n, initially the energy decreases
with increasing p and later on increases with increasing p. As explained earlier, this behavior can be understood by
the fact that energy for computation decreases with an increase in the number of cores running at reduced frequencies,
and energy for communication increases with increasing cores. Furthermore, we can see that increasing the problem
size leads to an increase in the optimal number of cores required for minimum energy consumption.
We now consider the sensitivity of this analysis with respect to the ratio k. Fig. 6.2 plots the optimal number of
cores required for minimum energy consumption by varying the problem size and k. It shows that for a fixed problem
size, the optimal number of cores required for minimum energy consumption decreases with increasing k. Moreover,
it is evident that with increasing problem size, the trend remains the same.
The above graph analysis depicts the exact behavior of the optimal number of cores as function of problem size
for the given input range and appears to generalize to larger problem sizes. We now provide an analytic expression
for the asymptotic behavior of the optimal number of cores as a function of problem size. Note that, for the addition
algorithm, problem size (W) is same as the input size.
Asymptotic Analysis: Note that, if n/p  log p i.e., n  p · log p, then X ≈ F/p. Thus, the energy consumed by
53
the parallel addition algorithm running on p cores at frequency X is given by:
E = Edβ(n− 1)F
2
p2
+ (1/4)ewh
√
p(
√
p+ 1) log p+ El(n− 1)β (6.1)
The optimal number of cores required for minimum energy consumption is given by
popt =
(
24EdF
2
ewh
n
log( 8EdF
2
ewh
n)
)1/3
= O
((
n
log n
)1/3)
= O
((
W
logW
)1/3)
Thus, the asymptotic energy scalability under iso-performance of the parallel addition algorithm isO((W/ logW )1/3).
Note that, n should be greater than p log p for this asymptotic result to apply.
6.2 Energy Bounded Scalability
We now evaluate the energy bounded scalability of the parallel addition algorithm on a 2-dimensional square mesh.
The total active time (Tactive) at all the cores as a function of the frequency of the cores is given by
Tactive = p((n/p− 1 + log p)β 1
X
+ (ts + tw) log p+ 2th(
√
p− 1))
The expression for the energy consumed by the parallel algorithm as a function of the frequency of the cores, using
the energy model is given by
Ecomp = Ed · Scomp ·X2
Eleak = El · Tactive ·X
= El · p((n/p− 1 + log p)β + ((ts + tw) log p+ 2th(√p− 1)) ·X)
Here, we fix the energy budget E to be that of the energy required for the sequential algorithm, running on a single
core at maximum frequency F (Eseq). Given the energy budget Eseq , the frequency X with which the cores should
run is given by
54
X =
−Elppitcomm
2Ed(n− 1)β
+
√
E2l p
2pi2tcomm + 4Ed(n− 1)β(Edβ(n− 1)F 2 − ((1/4)ewh
√
p(
√
p+ 1) log p)− Elp(n/p− 1 + log p)β)
2Ed(n− 1)β
where pitcomm = (ts + tw) log p+ 2th(
√
p− 1).
The time taken (inverse of performance) by the parallel algorithm as a function of the new frequency X is
Time Taken = (ts + tw) log p+ 2th(
√
p− 1) + (n/p− 1 + log p)β 1
X
(6.2)
We now analyze the time taken expression obtained above for the addition algorithm to evaluate the energy bound.
First, we analyze the graphs expressing energy bounded scalability. Later, we provide an asymptotic analysis for the
same. We use the same assumptions mentioned earlier for the energy scalability under iso-performance analysis of
the parallel addition algorithm. We also analyze the sensitivity of our results to a range of values of k, ratio of the
energy consumed for single message transfer, ewh, and the energy consumed for executing a single instruction at the
maximum frequency.
Figure 6.3: Addition: performance curve with time taken on
the Z axis, number of cores on the X axis and problem size on
the Y axis with k = 10, β = 1, kc = 5. Time taken is plotted
in units 1/F where F is the maximum frequency. Number of
cores is plotted in units 105. Problem size is plotted from 6×
107 to 108 in units of 106. The black curve on the XY plane
is the plot of optimal number of cores required for maximum
performance with varying problem size. For any input size,
the strict upper bound on the number of cores is depicted by
the distorted portion at the end of the curve.
Figure 6.4: Sensitivity analysis: optimal number of cores
on the Y axis, and k (ratio of the energy consumed for single
message communication to the energy consumed for execut-
ing a single instruction at maximum frequency) on the X axis
with input size n = 107.
55
Fig.6.3 plots time taken (inverse of performance) by parallel addition algorithm as a function of problem size and
number of cores. We can see that for any input size n, initially the time taken by the algorithm decreases with increas-
ing p and later on increases with increasing p. This behavior can be understood by the fact that performance increases
with an increase in the number of cores, and the energy remaining for computation decreases with increasing cores
(the difference between the energy budget and the energy used for communication). However, the behavior shows that
the optimal number of cores required for maximum performance is in the order of problem size. Furthermore, we can
see that increasing the problem size leads to an increase in the optimal number of cores.
We now consider the sensitivity of this analysis with respect to the ratio k. Fig. 6.4 plots the optimal number
of cores required for maximum performance by fixing the problem size and varying k. The plot shows that for a
fixed problem size, the optimal number of cores required for maximum performance decreases with increasing k,
approximating a c/k curve where c is some constant.
The above graph analysis depicts the exact behavior of the optimal number of cores as function of problem size
for the given input range and appears to generalize to larger problem sizes. We now provide an analytic expression for
the asymptotic behavior of the optimal number of cores as a function of problem size.
Asymptotic Analysis: Note that, If n  p · log p, then X ≈ F because most of the energy is consumed in the first
phase of the algorithm when all the processors are active. Thus, the time taken by the parallel addition algorithm
running on p cores at frequency F is given by:
Time Taken = (ts + tw) log p+ 2th(
√
p− 1) + (n/p− 1 + log p)β 1
F
= O(
√
p+ n/p)
The optimal number of cores required for maximum performance under the energy budget is given by
popt = O
(
n
2
3
)
= O
(
W
2
3
)
Thus, the asymptotic energy bounded scalability of the parallel addition algorithm is O(W 2/3). Note that, n should
be greater than p log p for this asymptotic result to apply.
56
6.3 Utility Based Scalability
We now evaluate the utility based scalability of the parallel addition algorithm on a 2-dimensional square mesh. The
expression for the energy consumed by the parallel algorithm as a function of the frequency of the cores, using the
energy model is given by
Ecomp = Ed · Scomp ·X2
Eleak = El · Tactive ·X
= El · p((n/p− 1 + log p)β + ((ts + tw) log p+ 2th(√p− 1)) ·X)
The time taken (inverse of performance) by the parallel algorithm as a function of frequency is
T (p,X) = ((Kc0 +Kc1) log p+ 2Kc2(
√
p− 1)) 1
F
+ (n/p− 1 + log p)β 1
X
(6.3)
we now frame an expression for the cost function C(p,X) of the parallel addition algorithm using Eq. 5.5:
C(p,X) = α(Ecomp + Ecomm + Eleak) + T (p,X)
= α(Ed · Scomp ·X2 + (1/4)ewh√p(√p+ 1) log p)
+ α(El · p((n/p− 1 + log p)β + ((Kc0 +Kc1) log p+ 2Kc2(
√
p− 1))X
F
))
+ ((Kc0 +Kc1) log p+ 2Kc2(
√
p− 1)) 1
F
+ (n/p− 1 + log p)β 1
X
As noted earlier, E increases and T decreases as either p or X increase. Thus, there exist an optimal configuration
at which the cost is minimized. The cost surface shown in Fig 6.5 depicts the energy-time trade-off assuming a
(reasonable) set of values for the various parameters. In the next section, we analyze the cost expression obtained
above in order to understand the properties of optimal configurations.
We now analyze the cost expression obtained above for the addition algorithm in order to determine the number
of cores and their frequencies which minimize cost as a function of the problem size.
We use the same assumptions mentioned earlier for the energy scalability under iso-performance analysis of the
parallel addition algorithm. We also analyze the sensitivity of our results to a range of values of k and α. For
convenience, we denote X = γ ·F such that 0 < γ < 1 (since 0 < X < F ). Note that, with the above simplification,
the cost expression will be free of F and the range of X is mapped correspondingly to the range of the parameter γ.
We use a brute force search technique to evaluate the optimal pair of number of cores and frequency of the cores
required to minimize cost. Fig. 6.6 and Fig. 6.7 plot, respectively, the number of cores and the frequency of these
57
Figure 6.5: Addition: Cost curve with C(p,X) on the Z axis, number of cores on the X axis and frequency on the Y
axis with k = 500, β = 1, Kc = 500 and α = 0.1. Frequency is plotted in units F/100 where F is the maximum
frequency. The number of cores is plotted in units of 104.
Figure 6.6: Addition: optimal number of cores on the Y
axis and input size on the X axis with k = 500, β = 1,
Kc = 500 and α = 0.1.
Figure 6.7: Addition: Optimal frequency on the Y axis
and input size on the X axis with k = 500, β = 1,
Kc = 500 and α = 0.1. Frequency on Y axis is plot-
ted in units of F/100000 where F is the maximum fre-
quency.
cores (when active) which minimizes the cost as a function of problem size. We see that the optimal number of
cores required increases with increasing problem size. However, the frequency of cores required to minimize the cost
decreases with problem size.
We now consider the sensitivity of this analysis with respect to k: recall that k is the ratio of energy used per
message over the energy used per instruction. Fig. 6.8 and Fig. 6.9 plot the number of cores and the frequency of the
cores required to minimize cost by fixing the problem size and varying k respectively. We see that optimal number
of cores decreases with increasing k. Furthermore, the frequency of cores required to minimize cost increases with
increasing k. We observe that this trend remains the same for the range of input values we studied (108 to 1010).
We now consider the sensitivity of this analysis with respect to the ratio α, the cost of unit energy. Fig. 6.10 and
58
Figure 6.8: Sensitivity analysis: optimal number of
cores on the Y axis and k (ratio of the energy consumed
for single message transfer and the energy consumed
for executing a single instruction at the maximum fre-
quency) on the X axis with n = 108, β = 1, Kc = 500
and α = 0.1.
Figure 6.9: Sensitivity analysis: Optimal frequency on
the Y axis and k (ratio of the energy consumed for sin-
gle message transfer and the energy consumed for exe-
cuting a single instruction at the maximum frequency)
on the X axis n = 108, β = 1, Kc = 500 and α = 0.1.
Frequency on Y axis is plotted in units of F/100000.
Figure 6.10: Sensitivity analysis: optimal number of
cores on the Y axis and cost of unit energy (α) (mea-
sured relative to the cost of running the parallel system
for unit time) on the X axis with n = 108, β = 1,
Kc = 500 and k = 500.
Figure 6.11: Sensitivity analysis: Optimal frequency
on the Y axis and cost of unit energy (α) (measured
relative to the cost of running the parallel system for
unit time) on X axis with n = 108, β = 1, Kc = 500
and k = 500. Frequency on the X axis is plotted in
units of F/100000.
Fig. 6.11 plot, respectively, the number of cores and their frequency (when active) that is required to minimize cost.
We do this by fixing the problem size and varying α. We note that both the optimal number of cores and the frequency
of the cores required to minimize cost decreases with increasing α. We observe that this trend also remains the same
for the entire range of the input values considered (108 to 1010).
The above graphs depict the exact behavior of the optimal number of cores and their frequencies as a function of
problem size over a specific input range. We now provide an analytic expression for the asymptotic behavior of the
optimal number of cores and frequency as a function of problem size. We use polynomial multivariate optimization
59
technique for this purpose.
Asymptotic analysis
Note that, If n  p · log p the cost expression of the parallel addition algorithm running on p cores at frequency
X can be approximated as
C(p,X) ≈ α(EdnX2 + (1/4)ewhp log p) + 2Kc2
√
p
1
F
+
n
pX
The optimal number of cores and frequency required for minimum cost varies with problem size (W) as follows
popt = O
(
n
4
5
(log n)
3
5
)
= O
(
W
4
5
(logW )
3
5
)
Fopt = O
(
(log n)
1
5
n
4
15
)
= O
(
(logW )
1
5
W
4
15
)
Note that, n should be greater than p log p for these asymptotic results to apply.
60
Chapter 7
Dense Matrix Algorithms
Algorithms involving matrices and vectors are applied in several numerical and non-numerical contexts. We can
classify matrices into two broad categories according to the kind of algorithms that are appropriate for them. The first
category is dense or full matrices with few or no zero entries. The second category is sparse matrices, in which a
majority of the elements are zero. In this section, we evaluate scalability metrics for some key algorithms of dense
matrices on mesh interconnection.
In order to processes a matrix in parallel, we must partition it so that the partitions can be assigned to different
processors. We briefly discuss some common ways to partition matrices among processors. We will refer to these as
partitioning schemes or mapping schemes. In the striped partitioning of a matrix, the matrix is divided into groups of
complete rows or columns, and each processor is assigned one such group. The partitioning is uniform if each group
contains equal number of rows or columns. The partitioning is called block-striped if each processor is assigned con-
tiguous rows or columns. In checkerboard partitioning, the matrix is divided into smaller square or rectangular blocks
or submatrices that are distributed among processors. In a uniform checkerboard partitioning, all submatrices are of
the same size. A checkerboard partitioning splits both the rows and the columns of the matrix, so no processor is as-
signed any complete row or column. A checkerboard-partitioned square matrix maps naturally onto a two-dimensional
square mesh of processors. Therefore, for a checkerboard mapping, it is often convenient to visualize the ensemble of
processors as a logical two-dimensional mesh.
7.1 Matrix Transposition
The transpose of a n × n matrix A is a matrix AT of the same size, such that AT [i, j] = A[i, j] for 0 ≤ i, j < n. In
the process of transposing a matrix, all elements below the principal diagonal move to positions above the principal
diagonal and vice versa. If we assume that it takes unit time to exchange a pair of matrix elements, then the sequential
runtime of transposing an n × n matrix is (n2 − n)/2, which can be approximated to n2/2. Thus, the problem
size of matrix transposition is O(n2). The following sections evaluate scalability metrics of parallel algorithms for
transposing square matrices using different partitioning schemes.
61
7.1.1 Checkerboard Partitioning
In this section, we consider an n×nmatrix mapped onto a logical square mesh of processors by using checkerboarding.
Assume that an n×nmatrix is stored in an n×nmesh of processors so that one processors holds as dingle element of
the matrix. To obtain the transpose, the matrix elements located below this diagonal must move to the corresponding
diametrically opposite locations above the diagonal, and vice versa. An element located below the diagonal first
moves up to the diagonal, and then to the right of its destination processor. Similarly, an element above the diagonal
moves down to the diagonal and then left to its destination processor. Now consider the case in which the number of
processors p is less than n2, and the matrix is distributed among the processors by using a uniform block-checkerboard
partitioning. The transpose of the entire matrix can be computed in two phases. In the first phase, the square matrix
blocks are treated as indivisible units, and the two-dimensional array of blocks is transposed. In the second phase, all
blocks are transposed locally within their respective processors.
During the communication phase, the matrix blocks initially residing on the bottom-left and top-right processors
cover the longest distances to swap their locations. These paths, covering approximately 2
√
p links each, determine
the total time spent in the communication phase. Since a block containing n2/p elements takes ts + twn2/p time to
move across a single link, it takes a total of 2(ts+twn2/p)
√
p time for all the blocks to move to their final destinations
in the mesh of processors. Total energy spent during the communication phase is given by
Ecomm = Σ
√
p−1
i=1 4ewh(
n2
p
)(
√
p− i)i
= 4ewh
n2
p
√
p(p− 1)
6
≈ (2/3)ewhn2√p
In the computation phase, each processors performs approximately (n2/2p) exchanges in transposing its (n/
√
p ×
n/
√
p) local submatrix. Thus, total number of exchanges at all processors during this phase is approximately (n2/2).
In summary,
pincomp =
n2
2p
β
pitcomm = 2
(
ts + tw
n2
p
)√
p
Scomp =
n2
2
β
Ecomm =
2
3
ewhn
2√p
where β is the number of cycles required for a local exchange.
62
Energy Scalability under Iso-performance
We now compute the energy scalability under iso-performance of parallel matrix transposition algorithm on 2-dimensional
mesh interconnect with checkerboard partitioning. We first Scale the computation steps of critical path so that the
parallel performance of parallel matrix transposition matches the specified performance requirement. Assuming the
performance target to be the time taken by the best sequential algorithm on a single core (processor) operating at
maximum frequency F , the new reduced frequency at which all p cores should run is given by:
X =
pincomp
Performance Target− pitcomm
=
(
n2
2p
)
β
n2β
2F − 2(ts + twn2/p)
√
p
= F
(
n2
2p
)
β
n2β
2 − 2
(
Kc0 +Kc1
n2
p
)√
p
The total active time at all the cores at new frequency is given by pn2β/(2F ). Therefore, expression for energy
consumption of the parallel transposition algorithm with checkerboard partitioning as per equation 5.1 is given by
E = Ecomp + Ecomm + Eleak
= Ed
n2
2
βX2 +
2
3
ewhn
2√p+ El pn
2β
2F
X
Asymptotic Analysis: Note that, If n  p then X ≈ F/p. Thus, the energy consumed by the parallel matrix
transposition algorithm on 2-dimensional mesh interconnect (with checkerboard partitioning running) with p cores
running at frequency X is given by:
E = Ed
n2
2
β
F 2
p2
+
2
3
ewhn
2√p+ Eln
2β
2
(7.1)
The optimal number of cores required for minimum energy consumption is given by
popt =
(
3EdβF
2
ewh
)2/5
= O(1)
Thus, the asymptotic energy scalability under iso-performance of the parallel matrix transposition algorithm on 2-
dimensional mesh interconnect with checkerboard partitioning running is O(1). In other words, optimal number of
cores for minimal energy under the performance budget (corresponding to the best sequential algorithm) is constant
63
irrespective of the problem size (not scalable). Note that, n should be greater than p for this asymptotic result to apply.
Energy Bounded Scalability
We now evaluate the energy bounded scalability of the parallel matrix transposition algorithm on 2-dimensional mesh
interconnect with checkerboard partitioning. The total active time (Tactive) at all the cores as a function of the fre-
quency of the cores is given by
Tactive = p(pincomp
1
X
+ pitcomm)
= p
(
n2
2p
β
1
X
+ 2(ts + tw
n2
p
)
√
p
)
Expression for the energy consumed by the parallel algorithm as a function of the frequency of the cores, using
the energy model is given by
Ecomp = Ed · n
2
2
βX2
Eleak = El · Tactive ·X
= Elp
(
n2
2p
β + 2(ts + tw
n2
p
)
√
pX
)
Here, we fix the energy budget E to be that of the energy required for the sequential algorithm, running on a single
core at maximum frequency F (Eseq). Given the energy budget Eseq , the frequency X with which the cores should
run is given by
X =
−Elppitcomm +
√
E2l p
2pi2tcomm + 4EdScomp(E − Ecomm − Elppincomp)
2EdScomp
=
−2Elp
(
ts + tw
n2
p
)√
p+
√
4E2l p
3
(
ts + tw
n2
p
)2
+ 2Edn2β
(
1
2Edn
2βF 2 − 23ewhn2
√
p− El n22 β
)
Edn2β
The time taken (inverse of performance) by the parallel algorithm as a function of the new frequency X is
Time Taken = 2
(
ts + tw
n2
p
)√
p+
n2
2p
β
1
X
(7.2)
Asymptotic Analysis: In order for X to have a valid solution, E − Ecomm should be greater than zero and which
simplifies to having
p <
(
3EdβF
2
4ewh
)2
64
If n p, the time taken by the parallel algorithm running decreases with p. Thus the optimal number of cores required
for maximum performance under the energy budget is given by
popt =
(
3EdβF
2
4ewh
)2
= O(1)
Thus, the asymptotic energy bounded scalability of this parallel algorithm is O(1). In other words, optimal number of
cores for maximum performance under the energy budget (corresponding to the best sequential algorithm) is constant
irrespective of the problem size. Thus, the parallel algorithm is not energy bounded scalable.
Utility Based Scalability
We now evaluate the utility based scalability of the parallel matrix transposition algorithm on 2-dimensional mesh
interconnect with checkerboard partitioning. Expression for the energy consumed by the parallel algorithm as a
function of the frequency of the cores, using the energy model is given by
Ecomp = Ed · n
2
2
βX2
Eleak = El · Tactive ·X
= Elp
(
n2
2p
β + 2(ts + tw
n2
p
)
√
pX
)
The time taken (inverse of performance) by the parallel algorithm as a function of frequency is
T (p,X) = 2
(
ts + tw
n2
p
)√
p+
n2
2p
β
1
X
(7.3)
we now frame an expression for the cost function C(p,X) of the parallel algorithm using Eq. 5.5:
C(p,X) = α(Ecomp + Ecomm + Eleak) + T (p,X)
= α(Ed · n
2
2
βX2 +
2
3
ewhn
2√p)
+ α(Elp
(
n2
2p
β + 2(ts + tw
n2
p
)
√
pX
)
)
+ 2
(
ts + tw
n2
p
)√
p+
n2
2p
β
1
X
Asymptotic Analysis: Note that, If n p then the cost expression of the parallel matrix transposition algorithm on 2-
dimensional mesh interconnect with checkerboard partitioning running on p cores at frequencyX can be approximated
65
as
C(p,X) ≈ O
(
n2X2 + n2
√
p+
n2√
p
+
n2
pX
)
Since each term in the cost expression includes square of the input size, the optimal number of cores and frequency
required for minimum cost do not vary with problem size.
7.1.2 Striped Partitioning
Consider an n×nmatrix mapped onto n processors such that each processor contains one full row of the matrix. With
this mapping Pi initially contains the elements of the matrix with indices [i, 0], [i, 1], ...[i, n−1]. Element [i, j] initially
resides on Pi, but moves to Pj during the transposition. Note that in the data transfer pattern of this procedure every
processor sends a distinct message of the matrix to every other processor. This is an example of all-to-all personalized
communication 3.2.3.
In general, if we use p processors such that p ≤ n, then each processor initially stores n/p rows (n2/p elements)
of the matrix. Performing the transposition now involves an all-to-all personalized communication of matrix blocks
n/p × n/p, instead of individual elements. At the end of the communication phase, each processor performs an
internal transposition of these blocks. One such block can be transposed using n2/(2p2) pairwise exchanges. Since
each processors has p such blocks, it requires n2/2p pairwise exchanges to transpose them. As derived earlier in 3.2.3,
the total time time and energy spent during the all-to-all personalized communication phase with message sizem being
n2/p2 are (2ts + twn2/p)(
sqrtp − 1) and ewhn2(√p − 1) respectively. Moreover, total number of pairwise exchanges at all processors comes
up to n2/2. In summary,
pincomp =
n2
2p
β
pitcomm =
(
2ts + tw
n2
p
)
(
√
p− 1)
Scomp =
n2
2
β
Ecomm = ewhn
2 (
√
p− 1)
where β is the number of cycles required for a local pairwise exchange.
Note that, this algorithm has very similar characteristics compared to the checkerboard partitioning algorithm on
the mesh interconnect. Thus the scalability results would remain the same as for checkerboard partitioning.
66
7.2 Matrix-Vector Multiplication
In this section, we consider the problem of multiplying a dense n × n matrix A with an n × 1 vector x to yield the
n × 1 result vector y. The sequential algorithm requires n2 multiplications and additions. Thus, the problem size
W = O(n2). At least two distinct formulations of matrix-vector multiplication are possible, depending on whether
rowwise striping or checkerboard is used.
7.2.1 Rowwise Striping
We first describe the case in which the n × n matrix is striped among n processors so that each processor store one
complete row of the matrix. The n × 1 vector x is distributed such that each processor stores one of its elements.
Processor Pi initially stores x[i] and A[i, 0], A[i, 1], ..., A[i, n − 1] and is responsible for computing y[i]. Vector x is
multiplied with each row of the matrix; hence every processors needs the entire vector. Since each processor starts
with only one element of x, an all-to-all broadcast is required to distribute all the elements to all the processors. After
the vector x is distributed among the processors processor Pi computes y[i] = Σn−1j=0 (A[i, j]×x[j]). The result vector
y is stored exactly the way the starting vector x was stored.
Consider the case in which p processors are used such p < n, and the matrix is partitioned among the processors
by using the block stripping. Each processor initially stores n/p complete rows of the matrix and a portion of the
vector x of size n/p. Since the vector x has to be multiplied with each row of the matrix, every processor needs the
entire vector. This again requires an all-to-all broadcast. The all-to-all broadcast takes place among the p processors
and involves messages of size n/p. As derived earlier in 3.2.3, the total time and energy spent during this phase
with message size m = n/p are 2ts(
√
p − 1) + twn and ewhn(p − 1) respectively. After this communication step,
each processor multiples its n/p rows with the vector x to produce n/p elements of the result vector. Each processor
performs n2/p multiplications and additions during the computation phase. In summary,
pincomp =
n2
p
β
pitcomm = (2ts(
√
p− 1) + twn)
Scomp = n
2β
Ecomm = ewhn(p− 1)
Energy Scalability under Iso-performance
We now compute the energy scalability under iso-performance of parallel matrix-vector multiplication algorithm on 2-
dimensional mesh interconnect with striped partitioning. We first Scale the computation steps of critical path so that the
67
performance of parallel algorithm matches the specified performance requirement. Assuming the performance target
to be the time taken by the best sequential algorithm on a single core (processor) operating at maximum frequency F ,
the new reduced frequency at which all p cores should run is given by:
X =
pincomp
Performance Target− pitcomm
=
n2
p
n2β 1F − 2ts(
√
p− 1) + twn
The total active time at all the cores at new frequency is given by pn2β/F . Therefore, expression for energy
consumption of the parallel matrix-vector multiplication algorithm with striped partitioning as per equation 5.1 is
given by
E = Ecomp + Ecomm + Eleak
= Edn
2βX2 + ewhn(p− 1) + El pn
2β
F
X
Asymptotic Analysis: Note that, If n  p then X ≈ F/p. Thus, the energy consumed by the parallel matrix-vector
multiplication algorithm on 2-dimensional mesh interconnect (with stripped partitioning running) with p cores running
at frequency X is given by:
E = Edn
2β
F 2
p2
+ ewhn(p− 1) + Eln2β (7.4)
The optimal number of cores required for minimum energy consumption is given by
popt =
(
2EdβF
2n
ewh
)1/3
= O(n
1
3 )
= O(W
1
6 )
Thus, the asymptotic energy scalability under iso-performance of the parallel matrix-vector multiplication algorithm
on 2-dimensional mesh interconnect with stripped partitioning running is O(W 1/6). Note that, n should be greater
than p for this asymptotic result to apply.
68
Energy Bounded Scalability
We now evaluate the energy bounded scalability of the parallel matrix-vector multiplication algorithm on 2-dimensional
mesh interconnect with row wise partitioning. The total active time (Tactive) at all the cores as a function of the fre-
quency of the cores is given by
Tactive = p(pincomp
1
X
+ pitcomm)
= p
(
n2
p
β
1
X
+ 2ts(
√
p− 1) + twn
)
Expression for the energy consumed by the parallel algorithm as a function of the frequency of the cores, using
the energy model is given by
Ecomp = Edn
2βX2
Eleak = El · Tactive ·X
= Elp
(
n2
p
β + (2ts(
√
p− 1) + twn)X
)
Here, we fix the energy budget E to be that of the energy required for the sequential algorithm, running on a single
core at maximum frequency F (Eseq). Given the energy budget Eseq , the frequency X with which the cores should
run is given by
X =
−Elppitcomm +
√
E2l p
2pi2tcomm + 4EdScomp(E − Ecomm − Elppincomp)
2EdScomp
=
−Elp
(
2ts(
√
p− 1) + twn
)
+
√
E2l p
2
(
2ts(
√
p− 1) + twn
)2
+ 4Edn2β(Edn2βF 2 − ewhn(p− 1)− Eln2β)
2Edn2β
The time taken (inverse of performance) by the parallel algorithm as a function of the new frequency X is
Time Taken = (2ts(
√
p− 1) + twn) + n
2
p
β
1
X
(7.5)
Asymptotic Analysis: Note that, If n  p, then X ≈ F . Thus, the time taken by the parallel matrix-vector
multiplication algorithm running on p cores is given by:
Time Taken = (2ts(
√
p− 1) + twn) + n
2
p
β
1
F
69
= O
(√
p+ n+
n2
p
)
The optimal number of cores required for maximum performance under the energy budget is given by
popt = O
(
n
4
3
)
= O
(
W
2
3
)
Thus, the asymptotic energy bounded scalability of the parallel algorithm is O(W
2
3 ). Note that, n should be greater
than p for this asymptotic result to apply.
Utility Based Scalability
We now evaluate the utility based scalability of the parallel matrix-vector multiplication algorithm on 2-dimensional
mesh interconnect with row wise partitioning. Expression for the energy consumed by the parallel algorithm as a
function of the frequency of the cores, using the energy model is given by
Ecomp = Edn
2βX2
Eleak = El · Tactive ·X
= Elp
(
n2
p
β + (2ts(
√
p− 1) + twn)X
)
The time taken (inverse of performance) by the parallel algorithm as a function of frequency is
T (p,X) = (2ts(
√
p− 1) + twn) + n
2
p
β
1
X
(7.6)
we now frame an expression for the cost function C(p,X) of the parallel algorithm using Eq. 5.5:
C(p,X) = α(Ecomp + Ecomm + Eleak) + T (p,X)
= α(Edn
2βX2 + ewhn(p− 1))
+ α(Elp
(
n2
p
β + (2ts(
√
p− 1) + twn)X
)
)
+ (2ts(
√
p− 1) + twn) + n
2
p
β
1
X
Asymptotic Analysis: Note that, If n p the cost expression of the parallel matrix-vector multiplication algorithm on
2-dimensional mesh interconnect with row wise partitioning running on p cores at frequency X can be approximated
70
as
C(p,X) ≈ O
(
n2X2 + np+ n+
n
pX
)
The optimal number of cores and frequency required for minimum cost varies with problem size as follows
popt = O
(
n
3
5
)
= O
(
W
3
10
)
Fopt = O
(
n−
1
5
)
= O
(
W−
1
10
)
7.2.2 Checkerboard Partitioning
In this section, we analyze parallel matrix-vector multiplication for the case in which the matrix is distributed among
the processors by using a block-checkerboard partitioning. Consider a 2-dimensional mesh of p processors in which
each processor stores an (n/
√
p)times(n/
√
p) block of the matrix. The vector is distributed in portions of n/
√
p
elements in the last column of processors only. The entire vector must be distributed on each row of processors before
the multiplication can be performed. First the vector is aligned along the main diagonal. For this, each processor in the
rightmost column sends its n/
√
p vector elements to the diagonal processors in its row. Then a columnwise one-to-all
broadcast of these n/
√
p elements takes place. Each processor then performs n2/p multiplications and locally adds
the n/
√
p sets of products. At the end of this step, each processor has n/
√
p partial sums that must be accumulated
along each row to obtain the result vector. Hence, the last step of the algorithm is a single node accumulation of the
n/
√
p values in each row, with the rightmost processor of the row as the destination.
The first step of sending a message of size n/
√
p from the rightmost processor of a row to the diagonal processor
takes the maximum time of ts + twn/
√
p + th
√
p for the first row of processors in the mesh. Furthermore, energy
spent during this steps is (1/2)ewhn(
√
p+ 1). We can perform the columnwise one-to-all broadcast in approximately
(ts+ twn/
√
p) log(
√
p) + th
√
p time and (1/4)ewhn
√
p log penergy by using the procedure described in section ref-
patterns for a ring with cut-through routing. Ignoring the time to perform additions, the final row wise single-node
accumulation takes approximately (ts + twn/
√
p) log(
√
p) + th
√
p time and (1/4)ewhn
√
p log p energy for com-
munication. Moreover, each processor performs approximately n2/p multiplication and additions in the computation
phase. In summary,
pincomp =
n2
p
β
71
pitcomm = ts + twn/
√
p+ th
√
p+ 2((ts + twn/
√
p) log(
√
p) + th
√
p)
≈
(
ts log p+ tw
n√
p
log p+ 3th
√
p
)
Scomp = n
2β
Ecomm = (1/2)ewhn(
√
p+ 1) + (1/2)ewhn
√
p log p
≈ (1/2)ewhn√p log p
Energy Scalability under Iso-performance
We now compute the energy scalability under iso-performance of parallel matrix-vector multiplication algorithm on
2-dimensional mesh interconnect with checkerboard partitioning. We first Scale the computation steps of critical
path so that the performance of the parallel algorithm matches the specified performance requirement. Assuming the
performance target to be the time taken by the best sequential algorithm on a single core (processor) operating at
maximum frequency F , the new reduced frequency at which all p cores should run is given by:
X =
pincomp
Performance Target− pitcomm
=
n2
p
n2β 1F −
(
ts log p+ tw
n√
p log p+ 3th
√
p
)
The total active time at all the cores at new frequency is given by pn2β/F . Therefore, expression for energy
consumption of the parallel matrix-vector multiplication algorithm with checkerboard partitioning as per equation 5.1
is given by
E = Ecomp + Ecomm + Eleak
= Edn
2βX2 +
1
2
ewhn
√
p log p+ El
pn2β
F
X
Asymptotic Analysis: Note that, If n  p then X ≈ F/p. Thus, the energy consumed by the parallel matrix-vector
multiplication algorithm on 2-dimensional mesh interconnect (with checkerboard partitioning running) with p cores
running at frequency X is given by:
E = Edn
2β
F 2
p2
+
1
2
ewhn
√
p log p+ Eln
2β (7.7)
72
The optimal number of cores required for minimum energy consumption is given by
popt =
 8EdβF 2newh
2
5 log
(
8EdβF 2n
ewh
)
2/5
= O
((
n
log n
) 2
5
)
= O
(
W
1
5
(logW )
2
5
)
Note that, n should be greater than p for this asymptotic result to apply.
Energy Bounded Scalability
We now evaluate the energy bounded scalability of the parallel matrix-vector multiplication algorithm on 2-dimensional
mesh interconnect with checkerboard partitioning. The total active time (Tactive) at all the cores as a function of the
frequency of the cores is given by
Tactive = p(pincomp
1
X
+ pitcomm)
= p
(
n2
p
1
X
+
(
ts log p+ tw
n√
p
log p+ 3th
√
p
))
Expression for the energy consumed by the parallel algorithm as a function of the frequency of the cores, using
the energy model is given by
Ecomp = Edn
2βX2
Eleak = El · Tactive ·X
= Elp
(
n2
p
β +
(
ts log p+ tw
n√
p
log p+ 3th
√
p
)
X
)
Here, we fix the energy budget E to be that of the energy required for the sequential algorithm, running on a single
core at maximum frequency F (Eseq). Given the energy budget Eseq , the frequency X with which the cores should
run is given by
X =
−Elppitcomm +
√
E2l p
2pi2tcomm + 4EdScomp(E − Ecomm − Elppincomp)
2EdScomp
=
−Elppitcomm +
√
E2l p
2pi2tcomm + 4Edn
2β(Edn2βF 2 − (1/2)ewhn√p log p− Eln2β)
2Edn2β
73
The time taken (inverse of performance) by the parallel algorithm as a function of the new frequency X is
Time Taken =
(
ts log p+ tw
n√
p
log p+ 3th
√
p
)
+
n2
p
β
1
X
(7.8)
Asymptotic Analysis: Note that, If n  p log p, then X ≈ F . Thus, the time taken by the parallel matrix-vector
multiplication algorithm running on p cores is given by:
Time Taken =
(
ts log p+ tw
n√
p
log p+ 3th
√
p
)
+
n2
p
β
1
F
= O
(
n log p√
p
+
√
p+
n2
p
)
≈ O
(√
p+
n2
p
)
The optimal number of cores required for maximum performance under the energy budget is given by
popt = O
(
n
4
3
)
= O
(
W
2
3
)
Thus, the asymptotic energy bounded scalability of the parallel algorithm is O(W
2
3 ). Note that, n should be greater
than p log p for this asymptotic result to apply.
Utility Based Scalability
We now evaluate the utility based scalability of the parallel matrix-vector multiplication algorithm on 2-dimensional
mesh interconnect with checkerboard partitioning. Expression for the energy consumed by the parallel algorithm as a
function of the frequency of the cores, using the energy model is given by
Ecomp = Edn
2βX2
Eleak = El · Tactive ·X
= Elp
(
n2
p
β +
(
ts log p+ tw
n√
p
log p+ 3th
√
p
)
X
)
The time taken (inverse of performance) by the parallel algorithm as a function of frequency is
T (p,X) =
(
ts log p+ tw
n√
p
log p+ 3th
√
p
)
+
n2
p
β
1
X
(7.9)
74
we now frame an expression for the cost function C(p,X) of the parallel algorithm using Eq. 5.5:
C(p,X) = α(Ecomp + Ecomm + Eleak) + T (p,X)
= α(Edn
2βX2 + (1/2)ewhn
√
p log p)
+ α(Elp
(
n2
p
β +
(
ts log p+ tw
n√
p
log p+ 3th
√
p
)
X
)
)
+
(
ts log p+ tw
n√
p
log p+ 3th
√
p
)
+
n2
p
β
1
X
Asymptotic Analysis: Note that, If n  p log p the cost expression of the parallel matrix-vector multiplication
algorithm on 2-dimensional mesh interconnect with checkerboard partitioning running on p cores at frequency X can
be approximated as
C(p,X) ≈ O
(
n2X2 + n
√
p log p+
n2
pX
)
The optimal number of cores and frequency required for minimum cost varies with problem size as follows
popt = O
((
n
log n
) 6
7
)
= O
(
W
3
7
(logW )
6
7
)
Fopt = O
((
n
log n
)− 27)
= O
(
W−
1
7
(logW )
− 27
)
7.3 Matrix Multiplication
In this section we analyze parallel algorithms for multiplying two n× n dense, square matrices A and B to yield the
product matrix C = A×B. The sequential algorithm for matrix computation requires n3 multiplication and addition
pairs. Thus, the problem size W = O(n3).
Consider two n×nmatricesA andB partitioned into p blocksAi,j andBi,j(o ≤ i, j < √p) of size n/√p×n/√p
each. These blocks are mapped onto a
√
p × √p mesh of processors. The processors are labeled from P0,0 to
P√p−1,√p−1. Processor Pi,j initially stores Ai,j and Bi,j and computes block Ci,j of the result matrix. Computing
the submatrix Ci,j requires all submatrices Ai,k and Bk,j for 0 ≤ k < √p. To acquire all the required blocks, an
all-to-all broadcast of matrix A’s blocks is performed in each row of processors, and an all-to-all broadcast of matrix
75
B’s blocks is performed in each column. After Pi,j acquires Ai,0, Ai,1, ...Ai,√p−1 and B0,j , B1,j , ..., B√p−1,j , it
performs the submatrix multiplication and addition step. Time and energy spent during the communication phase are
2(ts + twn
2/p)
√
p and 2ewhn2(
√
p− 1) respectively. In summary,
pincomp =
n3
p
β
pitcomm = 2
(
ts + tw
n2
p
)√
p
Scomp = n
3β
Ecomm = 2ewhn
2 (
√
p− 1)
Energy Scalability under Iso-performance
We now compute the energy scalability under iso-performance of parallel matrix multiplication algorithm on 2-
dimensional mesh interconnect with checkerboard partitioning. We first Scale the computation steps of critical path so
that the performance of the parallel algorithm matches the specified performance requirement. Assuming the perfor-
mance target to be the time taken by the best sequential algorithm on a single core (processor) operating at maximum
frequency F , the new reduced frequency at which all p cores should run is given by:
X =
pincomp
Performance Target− pitcomm
=
n3
p
n3β 1F − 2(ts + twn2/p)
√
p
The total active time at all the cores at new frequency is given by pn3β/F . Therefore, expression for energy
consumption of the parallel matrix multiplication algorithm with checkerboard partitioning as per equation 5.1 is
given by
E = Ecomp + Ecomm + Eleak
= Edn
3βX2 + 2ewhn
2(
√
p− 1) + El pn
3β
F
X
Asymptotic Analysis:: Note that, If n  p then X ≈ F/p. Thus, the energy consumed by the parallel matrix
multiplication algorithm on 2-dimensional mesh interconnect (with checkerboarding partitioning running) with p cores
running at frequency X is given by:
76
E = Edn
3β
F 2
P 2
+ 2ewhn
2(
√
p− 1) + Eln3β (7.10)
The optimal number of cores required for minimum energy consumption is given by
popt =
(
2EdβF
2n
ewh
)2/5
= O((n)
2
5 )
= O((W )
2
15 )
Thus, the asymptotic energy scalability under iso-performance of the parallel matrix multiplication algorithm on 2-
dimensional mesh interconnect with checkerboard partitioning running is O(W 2/15). Note that, n should be greater
than p for this asymptotic result to apply.
Energy Bounded Scalability
We now evaluate the energy bounded scalability of the parallel matrix multiplication algorithm on 2-dimensional
mesh interconnect with checkerboard partitioning. The total active time (Tactive) at all the cores as a function of the
frequency of the cores is given by
Tactive = p(pincomp
1
X
+ pitcomm)
= p
(
n3
p
β
1
X
+ 2
(
ts + tw
n2
p
)√
p
)
Expression for the energy consumed by the parallel algorithm as a function of the frequency of the cores, using
the energy model is given by
Ecomp = Edn
3βX2
Eleak = El · Tactive ·X
= Elp
(
n3
p
β + 2
(
ts + tw
n2
p
)√
pX
)
Here, we fix the energy budget E to be that of the energy required for the sequential algorithm, running on a single
core at maximum frequency F (Eseq). Given the energy budget Eseq , the frequency X with which the cores should
run is given by
77
X =
−Elppitcomm +
√
E2l p
2pi2tcomm + 4EdScomp(E − Ecomm − Elppincomp)
2EdScomp
=
−2Elp
(
ts + tw
n2
p
)√
p+
√
4E2l p
3
(
ts + tw
n2
p
)2
+ 4Edn3β(Edn3βF 2 − 2ewhn2
(√
p− 1)− Eln3β)
2Edn3β
The time taken (inverse of performance) by the parallel algorithm as a function of the new frequency X is
Time Taken =
(
ts + tw
n2
p
)√
p+
n3
p
β
1
X
(7.11)
Asymptotic Analysis: Note that, If n  p, then X ≈ F . Thus, the time taken by the parallel matrix multiplication
algorithm running on p cores is given by:
Time Taken =
(
ts + tw
n2
p
)√
p+
n3
p
β
1
F
= O
(
n2√
p
+
√
p+
n3
p
)
The optimal number of cores required for maximum performance under the energy budget is given by
popt = O
(
n2
)
= O
(
W
2
3
)
Thus, the asymptotic energy bounded scalability of the parallel algorithm is O(W
2
3 ). Note that, n should be greater
than p for this asymptotic result to apply.
Utility Based Scalability
We now evaluate the utility based scalability of the parallel matrix multiplication algorithm on 2-dimensional mesh
interconnect with checkerboard partitioning. Expression for the energy consumed by the parallel algorithm as a
function of the frequency of the cores, using the energy model is given by
Ecomp = Edn
3βX2
Eleak = El · Tactive ·X
= Elp
(
n3
p
β + 2
(
ts + tw
n2
p
)√
pX
)
78
The time taken (inverse of performance) by the parallel algorithm as a function of frequency is
T (p,X) =
(
ts + tw
n2
p
)√
p+
n3
p
β
1
X
(7.12)
we now frame an expression for the cost function C(p,X) of the parallel algorithm using Eq. 5.5:
C(p,X) = α(Ecomp + Ecomm + Eleak) + T (p,X)
= α(Edn
3βX2 + 2ewhn
2 (
√
p− 1))
+ α(Elp
(
n3
p
β + 2
(
ts + tw
n2
p
)√
pX
)
)
+
(
ts + tw
n2
p
)√
p+
n3
p
β
1
X
Asymptotic Analysis:: If n p the cost expression of the parallel matrix multiplication algorithm on 2-dimensional
mesh interconnect with checkerboard partitioning running on p cores at frequency X can be approximated as
C(p,X) ≈ O
(
n3X2 + n2
√
p+
n3
pX
)
The optimal number of cores and frequency required for minimum cost varies with problem size as follows
popt = O
(
n
6
7
)
= O
(
W
2
7
)
Fopt = O
(
n−
2
7
)
= O
(
W−
2
21
)
7.4 Summary
The results of this chapter are summarized in the table 7.1.
Parallel Algorithm Energy scalability Energy bounded Utility based
under iso-performance scalability scalability
Matrix Transpose (checkerboard) O(1) O(1) O(1)
Matrix Transpose (rowwise) O(1) O(1) O(1)
Matrix Vector Multiplication (checkerboard) O
(
W
1
5
(logW )
2
5
)
O
(
W
2
3
)
O
(
W
3
7
(logW )
6
7
)
Matrix Vector Multiplication (rowwise) O
(
W
1
6
)
O
(
W
2
3
)
O
(
W
3
10
)
Matrix Multiplication O
(
W
2
15
)
O
(
W
2
3
)
O
(
W
2
7
)
Table 7.1: Scalability metrics of dense matrix parallel algorithms on 2D mesh interconnect
79
Chapter 8
Sorting
Sorting is one of the most common operations performed by a computer. Because sorted data are easier to manipulate
than randomly-ordered data, many algorithms require sorted data. Sorting is of additional importance to parallel
computing because of its close relation to the task of routing data among processes, which is an essential part of many
parallel algorithms.
Sorting is defined as the task of arranging an unordered collection of elements into monotonically increasing
(or decreasing) order. Specifically, let S = < a1, a2, ..., an > be a sequence of n elements in arbitrary order;
sorting transforms S into a monotonically increasing sequence S′ =< a′1, a
′
2, ..., a
′
n > such that for a
′
i ≤ a′j for
1 ≤ i ≤ j ≤ n, and S′ is a permutation of S.
Sorting algorithms can be categorized as comparison-based and noncomparison-based. A comparison-based al-
gorithm sorts an unordered sequence of elements by repeatedly comparing pairs of elements and, if they are out of
order, exchanging them. This fundamental operation of comparison-based sorting is called compare-exchange. The
lower bound on the sequential complexity of any sorting algorithms that is comparison-based is θ(nlogn), where n is
the number of elements to be sorted. Noncomparison-based algorithms sort by using certain known properties of the
elements (such as their binary representation or their distribution). The lower bound complexity of these algorithms is
θ(n). In this chapter, we analyze both types of sorting algorithms for our scalability metrics.
8.1 Sorting Networks: Bitonic Sort
sorting networks are based on a comparison network model, in which many comparison operations are performed
simultaneously. The key component of these networks is a comparator. A comparator is a device with two inputs x
and y and two outputs x′ and y′. For an increasing comparator, x′ = min{x, y} and y′ = max{x, y}; for a decreasing
comparator x′ = max{x, y} and y′ = min{x, y}. A sorting network is usually made up of a series of columns,
and each column contains a number of comparators connected in parallel. Each column of comparators performs a
permutation, and the output obtained from the final column is sorted in increasing or decreasing order. The depth of
a network is the number of columns it contains. Since the speed of a comparator is a technology-dependent constant,
80
the speed of the network is proportional to its depth.
A bitonic sorting network sort n elements in θ(log2 n) time. The key operation of the bitonic sorting network
is the rearrangement of a bitonic sequence into a sorted sequence. A bitonic sequence is a sequence of elements
< a0, a1, ..., an−1 > with the property that either (1) there exists an index i, 0 ≤ i ≤ n− 1, such that < a0, ..., ai >
is monotonically increasing and < ai+1, ..., an−1 > is monotonically decreasing. A bitonic sequence is rearranged
into a sorted sequence using an operation called bitonic split. During the split, a bitonic sequence of size n is split
into two smaller bitonic sequences. We can recursively obtain shorted bitonic sequences using bitonic split operation
for each of the bitonic subsequences until we obtain the subsequences of size one. At that point, the output is sorted
in monotonically increasing order. Since after each bitonic split operation the size of the problem is halved, the
number of splits required to rearrange the bitonic sequence into a sorted sequence is log n. The procedure of sorting
a bitonic sequence using bitonic splits is called bitonic merge. The task of sorting n unordered elements can be
done repeatedly by merging bitonic sequences of increasing length. Since any unsorted sequence of elements is
a concatenation of bitonic sequences of size two. Each stage of the bitonic sorting network merges adjacent bitonic
sequences in increasing and decreasing order. According to the definition of a bitonic sequence, the sequence obtained
by concatenating the increasing and decreasing sequences is bitonic. Hence, the output of each stage in the bitonic
sorting network is a concatenation of bitonic sequences that are twice as long as those at the input. By merging larger
and larger bitonic sequences, we eventually obtain a bitonic sequence of size n. Merging this sequence sorts the input.
The sorting network can be adapted and used as a sorting algorithm for parallel computers. We now describe how
this can be done on mesh-connected processors and analyze it for our scalability metrics. Here, we assume that the in-
put wires of the bitonic sorting network are mapped onto an p-processor mesh using row-major shuffled mapping [61].
The advantage of row-major shuffled mapping is that processes that perform compare exchange operations reside on
square subsections of the mesh whose size is inversely related to the frequency of compare-exchanges. In general,
wires that differ in the ith least-significant bit are mapped onto mesh processes that are 2b(i−1)/2ccommunication links
away. Each process is assigned a block of n/p elements and cooperates with the other processes to sort them. Ini-
tially, n/p elements assigned to each processor are initially sorted locally using a fast sequential algorithm. Then, the
problem reduces to sorting the p blocks using the sorting network. Since the total number of blocks is p, the bitonic
sort algorithm has a total of (1 + log p)(log p)/2 steps. During these steps, processes that are a certain distance apart
compare-exchange their elements. The distance between processes determines the communication overhead of the
parallel formulation. The total amount of communication performed by each process is Σlog pi=1 Σ
i
j=12
b(j−1)/2c ≈ 7√p.
During each step of the algorithm, each processor performs at most one comparison (n/p elements); thus, the total
computation performed by each processor is (n/p) log2 p. Since the computation and communication at the proces-
sors is well balanced, the total number of comparisons and energy spent by them is ksn log(n/p) + n log2 p where ks
81
is the sequential sort constant and ewh7n
√
p respectively, In Summary,
pincomp =
(
ks
n
p
log
(
n
p
)
+
n
p
log2 p
)
β
pitcomm =
(
ts + tw
n
p
)
7
√
p
Scomp =
(
ksn log
(
n
p
)
+ n log2 p
)
β
Ecomm = ewh7n
√
p
where β is number of cycles required for a single comparison.
Energy Scalability under Iso-performance
We now compute the energy scalability under iso-performance of parallel bitonic sort algorithm on 2-dimensional
mesh interconnect. We first Scale the computation steps of critical path so that the performance of parallel algorithm
matches the specified performance requirement. Assuming the performance target to be the time taken by the best
sequential algorithm on a single core (processor) operating at maximum frequency F , the new reduced frequency at
which all p cores should run is given by:
X =
pincomp
Performance Target− pitcomm
=
(
ks
n
p log
(
n
p
)
+ np log
2 p
)
β
n log nβ 1F −
(
ts + tw
n
p
)
7
√
p
The total active time at all the cores at new frequency is given by pn log nβ/F . Therefore, expression for energy
consumption of the parallel bitonic sort algorithm as per equation 5.1 is given by
E = Ecomp + Ecomm + Eleak
= Ed
(
ksn log
(
n
p
)
+ n log2 p
)
βX2 + ewh7n
√
p+ Elβpn log n
X
F
Asymptotic Analysis:: Note that, If n p1+log p then X ≈ F/p. Thus, the energy consumed by the parallel bitonic
sort algorithm on 2-dimensional mesh interconnect with p cores running at frequency X is given by:
E = Ed
(
ksn log
(
n
p
)
+ n log2 p
)
β
F 2
p2
+ ewh7n
√
p+ Eln log nβ (8.1)
82
The optimal number of cores required for minimum energy consumption is given by
popt = O
(
(log n)
2
5
)
= O
(
(logW )
2
5
)
Thus, the asymptotic energy scalability under iso-performance of the parallel bitonic sort algorithm on 2-dimensional
mesh interconnect is O((logW )2/5). Note that, n should be greater than p1+log p for this asymptotic result to apply.
Energy Bounded Scalability
We now evaluate the energy bounded scalability of the parallel bitonic sort algorithm on 2-dimensional mesh inter-
connect. The total active time (Tactive) at all the cores as a function of the frequency of the cores is given by
Tactive = p(pincomp
1
X
+ pitcomm)
= p
((
ks
n
p
log
(
n
p
)
+
n
p
log2 p
)
β
1
X
+
(
ts + tw
n
p
)
7
√
p
)
Expression for the energy consumed by the parallel algorithm as a function of the frequency of the cores, using
the energy model is given by
Ecomp = Ed
(
ksn log
(
n
p
)
+ n log2 p
)
βX2
Eleak = El · Tactive ·X
= Elp
((
ks
n
p
log
(
n
p
)
+
n
p
log2 p
)
β +
(
ts + tw
n
p
)
7
√
pX
)
Here, we fix the energy budget E to be that of the energy required for the sequential algorithm, running on a single
core at maximum frequency F (Eseq). Given the energy budget Eseq , the frequency X with which the cores should
run is given by
X =
−Elppitcomm +
√
E2l p
2pi2tcomm + 4EdScomp(E − Ecomm − Elppincomp)
2EdScomp
≈
−Elp
(
ts + tw
n
p
)
7
√
p
2Ed
(
ksn log
(
n
p
)
+ n log2 p
)
β
+
√
49E2l p
3
(
ts + tw
n
p
)2
+ 4Ed
(
ksn log
(
n
p
)
+ n log2 p
)
β(Edn log nβF 2 − ewh7n√p)
2Ed
(
ksn log
(
n
p
)
+ n log2 p
)
β
83
The time taken (inverse of performance) by the parallel algorithm as a function of the new frequency X is
Time Taken =
(
ts + tw
n
p
)
7
√
p+
(
ks
n
p
log
(
n
p
)
+
n
p
log2 p
)
β
1
X
(8.2)
Asymptotic Analysis: Note that, If n 2
√
P , thenX ≈ F . Thus, the time taken by the parallel bitonic sort algorithm
running on p cores is given by:
Time Taken =
(
ts + tw
n
p
)
7
√
p+
(
ks
n
p
log
(
n
p
)
+
n
p
log2 p
)
β
1
F
The optimal number of cores required for maximum performance under the energy budget is given by
popt = O
(
(log n)2
)
= O
(
(logW )2
)
Thus, the asymptotic energy bounded scalability of the parallel algorithm is O((logW )2). Note that, n should be
exponentially greater than p for this asymptotic result to apply.
Utility Based Scalability
We now evaluate the utility based scalability of the parallel bitonic sort algorithm on 2-dimensional mesh interconnect.
Expression for the energy consumed by the parallel algorithm as a function of the frequency of the cores, using the
energy model is given by
Ecomp = Ed
(
ksn log
(
n
p
)
+ n log2 p
)
βX2
Eleak = El · Tactive ·X
= Elp
((
ks
n
p
log
(
n
p
)
+
n
p
log2 p
)
β +
(
ts + tw
n
p
)
7
√
pX
)
The time taken (inverse of performance) by the parallel algorithm as a function of frequency is
T (p,X) =
(
ts + tw
n
p
)
7
√
p+
(
ks
n
p
log
(
n
p
)
+
n
p
log2 p
)
β
1
X
(8.3)
we now frame an expression for the cost function C(p,X) of the parallel algorithm using Eq. 5.5:
C(p,X) = α(Ecomp + Ecomm + Eleak) + T (p,X)
84
= α(Ed
(
ksn log
(
n
p
)
+ n log2 p
)
βX2 + ewh7n
√
p)
+ α(Elp
((
ks
n
p
log
(
n
p
)
+
n
p
log2 p
)
β +
(
ts + tw
n
p
)
7
√
pX
)
)
+
(
ts + tw
n
p
)
7
√
p+
(
ks
n
p
log
(
n
p
)
+
n
p
log2 p
)
β
1
X
Asymptotic Analysis:: If n plog p the cost expression of the parallel bitonic sort algorithm on 2-dimensional mesh
interconnect with p cores at frequency X can be approximated as
C(p,X) ≈ O
(
n log nX2 + n
√
p+ n log n
1
pX
)
The optimal number of cores and frequency required for minimum cost varies with problem size as follows
popt = O
(
(log n)
6
7
)
= O
(
(logW )
6
7
)
Fopt = O
(
(log n)
−2
7
)
= O
(
(logW )
−2
7
)
8.2 Odd-Even Transposition Sort
The odd-even transposition algorithm sorts n elements in n phases (n is even), each of which requires n/2 compare-
exchange operations. This algorithm alternates between two phases, called the odd and even phases. Let < a1, a2,
..., an > be the sequence to be sorted. During the odd phase, elements with odd indices are compared with their right
neighbors, and if they are out of sequence they are exchanged; thus, the pairs (a1, a2), (a3, a4), ..., (an−1, an) are
compare-exchanged (assuming n is even). Similarly, during the even phase, elements with even indices are compared
with their right neighbors, and if they are out of sequence they are exchanged; thus, the pairs (a2, a3), (a4, a5), ...,
(an−2, an−1) are compare-exchanged. After n phases of odd-even exchanges, the sequence is sorted. Each phase of
the algorithm (either odd or even) requires θ(n) comparisons, and there are a total of n phases; thus, the sequential
complexity is θ(n2).
It is easy to parallelize odd-even transposition sort. During each phase of the algorithm, compare-exchange oper-
ations on pairs of elements are performed simultaneously. Consider the one-element-per-process case. Let n be the
number of processes (also the number of elements to be sorted). Assume that the processes are arranged in a one-
dimensional array. Element ai initially resides on process Pi for i = 1, 2, ..., n. During the odd phase, each process
85
that has an odd label compare-exchanges its element with the element residing on its right neighbor. Similarly, during
the even phase, each process with an even label compare-exchanges its element with the element of its right neighbor.
During each phase of the algorithm, the odd or even processes perform a compare- exchange step with their right
neighbors. A total of n such phases are performed; thus, the parallel run time of this formulation is theta(n).
In general, if we use p processors such that p ≤ n, Initially, each process is assigned a block of n/p elements,
which it sorts internally in ks(n/p)log(n/p) comparisons. After this, the processes execute p phases (p/2 odd and
p/2 even), performing compare-split operations. At the end of these phases, the list is sorted. During each phase,
n/p comparisons comparisons are performed to merge two blocks, and time ts + tw(n/p) is spent communicating. In
Summary,
pincomp =
(
ks
n
p
log
(
n
p
)
+ n
)
β
pitcomm =
(
ts + tw
n
p
)
p
Scomp =
(
ksn log
(
n
p
)
+ np
)
β
Ecomm = ewhnp
where β is number of cycles required for a single comparison.
Energy Scalability under Iso-performance
We now compute the energy scalability under iso-performance of parallel odd-even transposition sort algorithm on
ring interconnect. We first Scale the computation steps of critical path so that the performance of parallel algorithm
matches the specified performance requirement. Assuming the performance target to be the time taken by the best
sequential algorithm on a single core (processor) operating at maximum frequency F , the new reduced frequency at
which all p cores should run is given by:
X =
pincomp
Performance Target− pitcomm
=
(
ks
n
p log
(
n
p
)
+ n
)
β
n log nβ 1F −
(
ts + tw
n
p
)
p
The total active time at all the cores at new frequency is given by pn log nβ/F . Therefore, expression for energy
consumption of the parallel odd-even transposition sort algorithm as per equation 5.1 is given by
E = Ecomp + Ecomm + Eleak
86
= Ed
(
ksn log
(
n
p
)
+ np
)
βX2 + ewhnp+ Elpn log nβ
X
F
Asymptotic Analysis:: Note that, If n  2p then X ≈ F/p. Thus, the energy consumed by the parallel odd-even
transposition sort algorithm on 2-dimensional mesh interconnect with p cores running at frequency X is given by:
E = Ed
(
ksn log
(
n
p
)
+ np
)
β
F 2
p2
+ ewhnp+ Eln log nβ (8.4)
The optimal number of cores required for minimum energy consumption is given by
popt = O
(
(logW )
1
3
)
Thus, the asymptotic energy scalability under iso-performance of the parallel odd-even transposition sort algorithm
on ring interconnect is O((logW )1/3). Note that, n should be exponentially greater than p for this asymptotic result
to apply.
Energy Bounded Scalability
We now evaluate the energy bounded scalability of the parallel odd-even transposition sort algorithm on ring intercon-
nect. The total active time (Tactive) at all the cores as a function of the frequency of the cores is given by
Tactive = p(pincomp
1
X
+ pitcomm)
= p
((
ks
n
p
log
(
n
p
)
+ n
)
β
1
X
+
(
ts + tw
n
p
)
p
)
Expression for the energy consumed by the parallel algorithm as a function of the frequency of the cores, using
the energy model is given by
Ecomp = Ed
(
ksn log
(
n
p
)
+ np
)
βX2
Eleak = El · Tactive ·X
= Elp
((
ks
n
p
log
(
n
p
)
+ n
)
β +
(
ts + tw
n
p
)
pX
)
Here, we fix the energy budget E to be that of the energy required for the sequential algorithm, running on a single
core at maximum frequency F (Eseq). Given the energy budget Eseq , the frequency X with which the cores should
run is given by
87
X ≈ −Elppitcomm +
√
E2l p
2pi2tcomm + 4EdScomp(E − Ecomm − Elppincomp)
2EdScomp
=
−Elp
(
ts + tw
n
p
)
p+
√
E2l p
4
(
ts + tw
n
p
)2
+ 4Ed
(
ksn log
(
n
p
)
+ np
)
β(Edn log nβF 2 − ewhnp)
2Ed
(
ksn log
(
n
p
)
+ np
)
β
The time taken (inverse of performance) by the parallel algorithm as a function of the new frequency X is
Time Taken =
(
ts + tw
n
p
)
p+
(
ks
n
p
log
(
n
p
)
+ n
)
β
1
X
(8.5)
Asymptotic Analysis: Note that, If n > 2p, then X ≈ F . Thus, the time taken by the parallel odd-even transposition
sort algorithm running on p cores is given by:
Time Taken =
(
ts + tw
n
p
)
p+
(
ks
n
p
log
(
n
p
)
+ n
)
β
1
F
The optimal number of cores required for maximum performance under the energy budget is given by
popt = O (logW )
Thus, the asymptotic energy bounded scalability of the parallel algorithm is O(logW ). Note that, again n should be
exponentially greater than p for this asymptotic result to apply.
Utility Based Scalability
We now evaluate the utility based scalability of the parallel odd-even transposition sort algorithm on ring interconnect.
Expression for the energy consumed by the parallel algorithm as a function of the frequency of the cores, using the
energy model is given by
Ecomp = Ed
(
ksn log
(
n
p
)
+ np
)
βX2
Eleak = El · Tactive ·X
= Elp
((
ks
n
p
log
(
n
p
)
+ n
)
β +
(
ts + tw
n
p
)
pX
)
88
The time taken (inverse of performance) by the parallel algorithm as a function of frequency is
T (p,X) =
(
ts + tw
n
p
)
p+
(
ks
n
p
log
(
n
p
)
+ n
)
β
1
X
(8.6)
we now frame an expression for the cost function C(p,X) of the parallel algorithm using Eq. 5.5:
C(p,X) = α(Ecomp + Ecomm + Eleak) + T (p,X)
= α(Ed
(
ksn log
(
n
p
)
+ np
)
βX2 + ewhnp)
+ α(Elp
((
ks
n
p
log
(
n
p
)
+ n
)
β +
(
ts + tw
n
p
)
pX
)
)
+
(
ts + tw
n
p
)
p+
(
ks
n
p
log
(
n
p
)
+ n
)
β
1
X
Asymptotic Analysis:: If n  2p the cost expression of the parallel odd-even transposition sort algorithm on ring
interconnect with p cores at frequency X can be approximated as
C(p,X) ≈ O
(
n log nX2 + np+ n log n
1
pX
)
The optimal number of cores and frequency required for minimum cost varies with input size as follows
popt = O
(
(log n)
3
5
)
= O
(
(logW )
3
5
)
Fopt = O
(
(log n)
−1
5
)
= O
(
(logW )
−1
5
)
8.3 Quick Sort
Recall that in the quicksort algorithm, an array is partitioned into two parts based on a pivot, and each part is solved
recursively. The sequential algorithm for this problem takes ks ·n · log(n) comparisons to sort n numbers, where ks ≈
1.4 is the quicksort constant. The the running time of the sequential algorithm is given by Tseq = β·ks·n·log(n)·(1/F )
where, β denotes the number of CPU cycles required for single comparison.
89
8.3.1 Naı¨ve Parallel Quicksort
In the naive parallel version, an input array is partitioned into two parts by a single core (based on a pivot), and then
one of the sub-arrays is assigned to another core. Now each of the cores partitions the arrays it is working on using
the same approach as above, and assigns one of its sub-arrays to other cores. This process continues until all the
available cores are used up. After the partitioning phase, in the average case, each core will have approximately an
equal division of all elements of the input array. Finally, all the cores sort their arrays in parallel, using the serial
quicksort algorithm on each core. The sorted input array can be recovered by traversing the cores. The naı¨ve parallel
quicksort algorithm is very inefficient (in terms of performance), as partitioning the array into two sub-arrays is done
by single core which means that the execution time of the naive parallel quicksort algorithm is bounded from below
by the length of the input array.
Assume that the input array has n elements and the number of cores available for sorting are p. Without loss of
generality, we assume both n and p are powers of two, so that n = 2a, for some a and p = 2b, for some b. For
simplicity, we also assume that during the partitioning step, each core partitions the array into two equal sub-arrays
by choosing the appropriate pivot (the usual average case analysis). The critical path of this parallel algorithm is the
execution done by the core which initiates the partitioning of the input array. The total number of computation steps in
the critical path is 2n(1−(1/p))+ks((n/p)·log(n/p)). The total number of communication steps and communication
messages (words) in the critical path are, respectively log(P ) and n
√
p. Total energy spent for message transfers for
this parallel algorithm running on p cores is ewh
√
p log(p) · (n/2). Moreover, the total number of computation steps
at all cores is approximately ks · n · log n. In summary
pincomp =
(
ks
n
p
log
(
n
p
)
+ 2n(1− 1
p
)
)
β
pitcomm = (ts log p+ twn
√
p)
Scomp = ksn log nβ
Ecomm =
1
2
ewhn
√
p log(p)
where β is number of cycles required for a single comparison.
Energy Scalability under Iso-performance
We now compute the energy scalability under iso-performance of naı¨ve parallel quicksort algorithm on 2-dimensional
mesh interconnect. We first Scale the computation steps of critical path so that the performance of parallel algorithm
matches the specified performance requirement. Assuming the performance target to be the time taken by the best
sequential algorithm on a single core (processor) operating at maximum frequency F , the new reduced frequency at
90
which all p cores should run is given by:
X =
pincomp
Performance Target− pitcomm
=
(
ks
n
p log
(
n
p
)
+ 2n(1− 1p )
)
β
ksn log nβ
1
F −
(
ts log p+ twn
√
p
)
The total active time at all the cores at new frequency is given by pn log nβ/F . Therefore, expression for energy
consumption of the naı¨ve parallel quicksort algorithm as per equation 5.1 is given by
E = Ecomp + Ecomm + Eleak
= Edksn log nβX
2 +
1
2
ewhn
√
p log(p) + Elpn log nβ
X
F
Asymptotic Analysis:: Note that, If n  2p then X ≈ F/p. Thus, the energy consumed by the naı¨ve parallel
quicksort algorithm on 2-dimensional mesh interconnect with p cores running at frequency X is given by:
E = Edksn log nβ
F 2
p2
+
1
2
ewhn
√
p log(p) + Eln log nβ (8.7)
The optimal number of cores required for minimum energy consumption is given by
popt = O
((
log n
log log n
) 2
5
)
= O
((
logW
log logW
) 2
5
)
Note that, n should be exponentially greater than p for this asymptotic result to apply.
Energy Bounded Scalability
We now evaluate the energy bounded scalability of the naı¨ve parallel quicksort algorithm on 2-dimensional mesh
interconnect. The total active time (Tactive) at all the cores as a function of the frequency of the cores is given by
Tactive = p(pincomp
1
X
+ pitcomm)
= p
((
ks
n
p
log
(
n
p
)
+ 2n(1− 1
p
)
)
β
1
X
+ (ts log p+ twn
√
p)
)
Expression for the energy consumed by the parallel algorithm as a function of the frequency of the cores, using
91
the energy model is given by
Ecomp = Edksn log nβX
2
Eleak = El · Tactive ·X
= Elp
((
ks
n
p
log
(
n
p
)
+ 2n(1− 1
p
)
)
β + (ts log p+ twn
√
p)X
)
Here, we fix the energy budget E to be that of the energy required for the sequential algorithm, running on a single
core at maximum frequency F (Eseq). Given the energy budget Eseq , the frequency X with which the cores should
run is given by
X =
−Elppitcomm +
√
E2l p
2pi2tcomm + 4EdScomp(E − Ecomm − Elppincomp)
2EdScomp
=
−Elp
(
ts log p+ twn
√
p
)
2Edksn log nβ
+
√
E2l p
2
(
ts log p+ twn
√
p
)2
+ 4Edksn log nβ(Edn log nβF 2 − 12ewhn
√
p log(p))
2Edksn log nβ
The time taken (inverse of performance) by the parallel algorithm as a function of the new frequency X is
Time Taken = (ts log p+ twn
√
p) +
(
ks
n
p
log
(
n
p
)
+ 2n(1− 1
p
)
)
β
1
X
(8.8)
Asymptotic Analysis: Note that, If n  p2√p, then X ≈ F . Thus, the time taken by the naı¨ve parallel quicksort
algorithm running on p cores is given by:
Time Taken = (ts log p+ twn
√
p) +
(
ks
n
p
log
(
n
p
)
+ 2n(1− 1
p
)
)
β
1
F
The optimal number of cores required for maximum performance under the energy budget is given by
popt = O
(
(log n)
2
3
)
= O
(
(logW )
2
3
)
Thus, the asymptotic energy bounded scalability of the parallel algorithm is O((logW )2/3). Note that, n should be
greater than p2
√
p for this asymptotic result to apply.
92
Utility Based Scalability
We now evaluate the utility based scalability of the naı¨ve parallel quicksort algorithm on 2-dimensional mesh inter-
connect. Expression for the energy consumed by the parallel algorithm as a function of the frequency of the cores,
using the energy model is given by
Ecomp = Edksn log nβX
2
Eleak = El · Tactive ·X
= Elp
((
ks
n
p
log
(
n
p
)
+ 2n(1− 1
p
)
)
β + (ts log p+ twn
√
p)X
)
The time taken (inverse of performance) by the parallel algorithm as a function of frequency is
T (p,X) = (ts log p+ twn
√
p) +
(
ks
n
p
log
(
n
p
)
+ 2n(1− 1
p
)
)
ββ
1
X
(8.9)
we now frame an expression for the cost function C(p,X) of the parallel algorithm using Eq. 5.5:
C(p,X) = α(Ecomp + Ecomm + Eleak) + T (p,X)
= α(Edksn log nβX
2 +
1
2
ewhn
√
p log(p))
+ α(Elp
((
ks
n
p
log
(
n
p
)
+ 2n(1− 1
p
)
)
β + (ts log p+ twn
√
p)X
)
)
+ (ts log p+ twn
√
p) +
(
ks
n
p
log
(
n
p
)
+ 2n(1− 1
p
)
)
β
1
X
Asymptotic Analysis:: If n  2p the cost expression of the naı¨ve parallel quicksort algorithm on 2-dimensional
mesh interconnect with p cores at frequency X can be approximated as
C(p,X) ≈ O
(
n log nX2 + n
√
p log p+ n log n
1
pX
)
The optimal number of cores and frequency required for minimum cost varies with problem size as follows
popt = O
((
log n
log log n
) 6
7
)
= O
((
logW
log logW
) 6
7
)
Fopt = O
((
log n
log log n
)−2
7
)
93
= O
((
logW
log logW
)−2
7
)
8.3.2 Efficient Parallel Quicksort
The parallel quicksort formulation [87] works as follows. LetN be the number of elements to be sorted and P = 2b be
the number of cores available. Each cores is assigned a block of N/P elements, and the labels of the cores {1, ..., P}
define the global order of the sorted sequence. For simplicity, we assume that the initial distribution of elements in
each core is uniform. The algorithm starts with all cores sorting their own set of elements (sequential quicksort). Then
Core 1 broadcasts the median of its elements to each of the other cores. This median acts as the pivot for partitioning
elements at all cores. Upon receiving the pivot, each cores partition its elements into elements smaller than the pivot
and elements larger than the pivot. Next, each Core i where i ∈ {1...P/2} exchanges elements with the Core i+P/2
such that core i retains all the elements smaller than the pivot, and Core i + P/2 retains all elements larger than
the pivot. After this step, each Core i i ∈ {1....P/2} stores elements smaller than the pivot, and remaining cores
({P/2 + 1, ...P}) store elements greater than the pivot. Upon receiving the elements, each core merges them with
its own set of elements so that all elements at the core remain sorted. The above procedure is performed recursively
for both sets of cores, splitting the elements further. After b recursions, all the elements are sorted with respect to the
global ordering imposed on the cores.
Because all of the cores are busy all of the time, the critical path of this parallel algorithm is the execution path of
any one of the cores. The total number of computation steps in the critical path is (n/p) · log p+ ks(n/p · log(n/p)),
where ks ≈ 1.4 is the quicksort constant. The total communication time in the critical path is 3(ts + tw(n/p))√p.
Total energy spent on message transfers for this parallel algorithm running on p cores is approximately ewhn
√
p.
Moreover, the total number of computation steps summed over all cores is approximately ks ·n · log(n). In Summary,
pincomp =
(
ks
n
p
log
(
n
p
)
+
n
p
log p
)
β
pitcomm = 3
(
ts + tw
n
p
)√
p
Scomp = ksn log nβ
Ecomm = ewhn
√
p
where β is number of cycles required for a single comparison.
94
Energy Scalability under Iso-performance
We now compute the energy scalability under iso-performance of efficient parallel quicksort algorithm on 2-dimensional
mesh interconnect. We first Scale the computation steps of critical path so that the performance of parallel algorithm
matches the specified performance requirement. Assuming the performance target to be the time taken by the best
sequential algorithm on a single core (processor) operating at maximum frequency F , the new reduced frequency at
which all p cores should run is given by:
X =
pincomp
Performance Target− pitcomm
=
(
ks
n
p log
(
n
p
)
+ np log p
)
β
n log nβ 1F − 3
(
ts + tw
n
p
)√
p
The total active time at all the cores at new frequency is given by pn log nβ/F . Therefore, expression for energy
consumption of the efficient parallel quick algorithm as per equation 5.1 is given by
E = Ecomp + Ecomm + Eleak
= Edksn log nβX
2 + ewhn
√
p+ Elpn log nβ
X
F
Asymptotic Analysis:: Note that, If n  p then X ≈ F/p. Thus, the energy consumed by the efficient parallel
quicksort algorithm on 2-dimensional mesh interconnect with p cores running at frequency X is given by:
E = Edksn log nβ
F 2
p2
+ ewhn
√
p+ Eln log nβ (8.10)
The optimal number of cores required for minimum energy consumption is given by
popt = O
(
(log n)
2
5
)
= O
(
(logW )
2
5
)
Thus, the asymptotic energy scalability under iso-performance of the efficient parallel quicksort algorithm on 2-
dimensional mesh interconnect is O((logW )2/5). Note that, n should be greater than p for this asymptotic result
to apply.
95
Energy Bounded Scalability
We now evaluate the energy bounded scalability of the efficient parallel quicksort algorithm on 2-dimensional mesh
interconnect. The total active time (Tactive) at all the cores as a function of the frequency of the cores is given by
Tactive = p(pincomp
1
X
+ pitcomm)
= p
((
ks
n
p
log
(
n
p
)
+
n
p
log p
)
β
1
X
+ 3
(
ts + tw
n
p
)√
p
)
Expression for the energy consumed by the parallel algorithm as a function of the frequency of the cores, using
the energy model is given by
Ecomp = Edksn log nβX
2
Eleak = El · Tactive ·X
= Elp
((
ks
n
p
log
(
n
p
)
+
n
p
log p
)
β + 3
(
ts + tw
n
p
)√
pX
)
Here, we fix the energy budget E to be that of the energy required for the sequential algorithm, running on a single
core at maximum frequency F (Eseq). Given the energy budget Eseq , the frequency X with which the cores should
run is given by
X =
−Elppitcomm +
√
E2l p
2pi2tcomm + 4EdScomp(E − Ecomm − Elppincomp)
2EdScomp
=
−Elp3
(
ts + tw
n
p
)√
p+
√
9E2l p
3
(
ts + tw
n
p
)2
+ 4Edksn log nβ(Edn log nβF 2 − ewhn√p)
2Edksn log nβ
The time taken (inverse of performance) by the parallel algorithm as a function of the new frequency X is
Time Taken = 3
(
ts + tw
n
p
)√
p+
(
ks
n
p
log
(
n
p
)
+
n
p
log p
)
β
1
X
(8.11)
Asymptotic Analysis: Note that, If n  2√p, then X ≈ F . Thus, the time taken by the efficient parallel quicksort
algorithm running on p cores is given by:
Time Taken = 3
(
ts + tw
n
p
)√
p+
(
ks
n
p
log
(
n
p
)
+
n
p
log p
)
β
1
F
96
The optimal number of cores required for maximum performance under the energy budget is given by
popt = O
(
(log n)2
)
= O
(
(logW )2
)
Thus, the asymptotic energy bounded scalability of the parallel algorithm is O
(
(logW )2
)
. Note that, n should be
greater than 2
√
p for this asymptotic result to apply.
Utility Based Scalability
We now evaluate the utility based scalability of the efficient parallel quicksort algorithm on 2-dimensional mesh
interconnect. Expression for the energy consumed by the parallel algorithm as a function of the frequency of the
cores, using the energy model is given by
Ecomp = Edksn log nβX
2
Eleak = El · Tactive ·X
= Elp
((
ks
n
p
log
(
n
p
)
+
n
p
log p
)
β + 3
(
ts + tw
n
p
)√
pX
)
The time taken (inverse of performance) by the parallel algorithm as a function of frequency is
T (p,X) = 3
(
ts + tw
n
p
)√
p+
(
ks
n
p
log
(
n
p
)
+
n
p
log p
)
β
1
X
(8.12)
we now frame an expression for the cost function C(p,X) of the parallel algorithm using Eq. 5.5:
C(p,X) = α(Ecomp + Ecomm + Eleak) + T (p,X)
= α(Edksn log nβX
2 + ewhn
√
p)
+ α(Elp
((
ks
n
p
log
(
n
p
)
+
n
p
log p
)
β + 3
(
ts + tw
n
p
)√
pX
)
)
+ 3
(
ts + tw
n
p
)√
p+
(
ks
n
p
log
(
n
p
)
+
n
p
log p
)
β
1
X
Asymptotic Analysis:: If n  p the cost expression of the efficient parallel quicksort algorithm on 2-dimensional
mesh interconnect with p cores at frequency X can be approximated as
C(p,X) ≈ O
(
n log nX2 + n
√
p log p+ n log n
1
pX
)
97
The optimal number of cores and frequency required for minimum cost varies with problem size as follows
popt = O
(
(logW )
6
7
)
Fopt = O
(
(logW )
−2
7
)
8.4 Sample Sort
In sample sort, a sample of size s is selected from the n-element sequence, and the range of the buckets is determined
by sorting the sample and choosingm−1 elements from the result. These elements (called splitters) divide the sample
into m equal-sized buckets. After defining the buckets, the algorithm sorts the elements in each bucket, yielding a
sorted sequence The performance of sample sort depends on the sample size s and the way it is selected from the n
element sequence. Consider a splitter selection scheme that guarantees that the number of elements ending up in each
bucket is roughly the same for all buckets. Let n be the number of elements to be sorted and m be the number of
buckets. The scheme works as follows. It divides the n elements intom blocks of size n/m each, and sorts each block
by using quicksort. From each sorted block it choosesm−1 evenly spaced elements. Them(m−1) elements selected
from all the blocks represent the sample used to determine the buckets. This scheme guarantees that the number of
elements ending up in each bucket is less than 2n/m.
We set m = p; thus, at the end of the algorithm, each process contains only the elements belonging to a single
bucket. Each process is assigned a block of n/p elements, which it sorts sequentially. It then chooses p − 1 evenly
spaced elements from the sorted block. Each process sends its p− 1 sample elements to one process say P0. Process
P0 then sequentially sorts the p(p− 1) sample elements and selects the p− 1 splitters. Process P0 then broadcasts the
p − 1 splitters to all the other processes. Next each processor, upon receiving the splitters partitions its block of n/p
elements into p sub-blocks, one for each of the p buckets. Each processor then sends sub-blocks to the appropriate
processes (all-to-all personalized communication). After this step, each processor has only the elements belonging
to the bucket assigned to it. Finally, each process sorts its bucket internally by using an optimal sequential sorting
algorithm.
Total parallel communication time of the above procedure is given by
pitcomm = (2ts(
√
p− 1) + tw(p− 1)2) + (ts + tw(p− 1)) log p+
(
2ts + tw
n
p
)(
√
p− 1)
)
≈ 4ts√p+ tw
(
p2 +
n√
p
)
Number of computation steps in the critical path of the parallel algorithm approximately is 2ks(n/p) log(n/p) +
98
ksp
2 log p2 + p log(n/p). Furthermore, total number of computations at all cores is 2ksn log(n/p) + ksp2 log p2 +
p2 log(n/p). Total energy spent for communication of the parallel algorithm is approximately, ewh(p2
√
p+ n
√
p) In
summary,
pincomp =
(
2ks
n
p
log
(
n
p
)
+ ksp
2 log p2 + p log
n
p
)
β
pitcomm =
(
4ts
√
p+ tw
(
p2 +
n√
p
))
Scomp =
(
2ksn log
(
n
p
)
+ ksp
2 log p2 + p2 log
n
p
)
β
Ecomm = ewh
(
p2
√
p+ n
√
p
)
where β is number of cycles required for a single comparison.
Energy Scalability under Iso-performance
We now compute the energy scalability under iso-performance of parallel sample sort algorithm on 2-dimensional
mesh interconnect. We first Scale the computation steps of critical path so that the performance of parallel algorithm
matches the specified performance requirement. Assuming the performance target to be the time taken by the best
sequential algorithm on a single core (processor) operating at maximum frequency F , the new reduced frequency at
which all p cores should run is given by:
X =
pincomp
Performance Target− pitcomm
=
(
2ks
n
p log
(
n
p
)
+ ksp
2 log p2 + p log np
)
β
n log nβ 1F −
(
4ts
√
p+ tw
(
p2 + n√p
))
The total active time at all the cores at new frequency is given by pn log nβ/F . Therefore, expression for energy
consumption of the parallel sample sort algorithm as per equation 5.1 is given by
E = Ecomp + Ecomm + Eleak
= Ed
(
2ksn log
(
n
p
)
+ ksp
2 log p2 + p2 log
n
p
)
βX2 + ewh
(
p2
√
p+ n
√
p
)
+ Elpn log nβ
X
F
Asymptotic Analysis:: Note that, If n  p3 then X ≈ F/p. Thus, the energy consumed by the parallel sample sort
algorithm on 2-dimensional mesh interconnect with p cores running at frequency X is given by:
99
E = Ed
(
2ksn log
(
n
p
)
+ ksp
2 log p2 + p2 log
n
p
)
β
F 2
p2
+ ewh
(
p2
√
p+ n
√
p
)
+ Eln log nβ (8.13)
The optimal number of cores required for minimum energy consumption is given by
popt = O
(
(log n)
2
5
)
= O
(
(logW )
2
5
)
Thus, the asymptotic energy scalability under iso-performance of the parallel sample sort algorithm on 2-dimensional
mesh interconnect is O((logW )2/5). Note that, n should be greater than p3 for this asymptotic result to apply.
Energy Bounded Scalability
We now evaluate the energy bounded scalability of the parallel sample sort algorithm on 2-dimensional mesh inter-
connect. The total active time (Tactive) at all the cores as a function of the frequency of the cores is given by
Tactive = p(pincomp
1
X
+ pitcomm)
= p
((
2ks
n
p
log
(
n
p
)
+ ksp
2 log p2 + p log
n
p
)
β
1
X
+
(
4ts
√
p+ tw
(
p2 +
n√
p
)))
Expression for the energy consumed by the parallel algorithm as a function of the frequency of the cores, using
the energy model is given by
Ecomp = Ed
(
2ksn log
(
n
p
)
+ ksp
2 log p2 + p2 log
n
p
)
βX2
Eleak = El · Tactive ·X
= Elp
((
2ks
n
p
log
(
n
p
)
+ ksp
2 log p2 + p log
n
p
)
β +
(
4ts
√
p+ tw
(
p2 +
n√
p
))
X
)
Here, we fix the energy budget E to be that of the energy required for the sequential algorithm, running on a single
core at maximum frequency F (Eseq). Given the energy budget Eseq , the frequency X with which the cores should
run is given by
X ≈ −Elppitcomm +
√
E2l p
2pi2tcomm + 4EdScomp(E − Ecomm − Elppincomp)
2EdScomp
100
=
−Elp
(
4ts
√
p+ tw
(
p2 + n√p
))
2Ed
(
2ksn log
(
n
p
))
β
+
√
E2l p
2
(
4ts
√
p+ tw
(
p2 + n√p
))2
+ 4Ed
(
2ksn log
(
n
p
))
β(Edn log nβF 2 − ewh
(
p2
√
p+ n
√
p
)
)
2Ed
(
2ksn log
(
n
p
))
β
The time taken (inverse of performance) by the parallel algorithm as a function of the new frequency X is
Time Taken =
(
4ts
√
p+ tw
(
p2 +
n√
p
))
+
(
2ks
n
p
log
(
n
p
)
+ ksp
2 log p2 + p log
n
p
)
β
1
X
(8.14)
Asymptotic Analysis: Note that, If n 2√p, thenX ≈ F . Thus, the time taken by the parallel sample sort algorithm
running on p cores is given by:
Time Taken =
(
4ts
√
p+ tw
(
p2 +
n√
p
))
+
(
2ks
n
p
log
(
n
p
)
+ ksp
2 log p2 + p log
n
p
)
β
1
F
The optimal number of cores required for maximum performance under the energy budget is given by
popt = O
(
(log n)2
)
= O
(
(logW )2
)
Thus, the asymptotic energy bounded scalability of the parallel algorithm is (logW )2. Note that, n should be greater
than 2
√
p for this asymptotic result to apply.
Utility Based Scalability
We now evaluate the utility based scalability of the parallel sample sort algorithm on 2-dimensional mesh interconnect.
Expression for the energy consumed by the parallel algorithm as a function of the frequency of the cores, using the
energy model is given by
Ecomp = Ed
(
2ksn log
(
n
p
)
+ ksp
2 log p2 + p2 log
n
p
)
βX2
Eleak = El · Tactive ·X
= Elp
((
2ks
n
p
log
(
n
p
)
+ ksp
2 log p2 + p log
n
p
)
β +
(
4ts
√
p+ tw
(
p2 +
n√
p
))
X
)
101
The time taken (inverse of performance) by the parallel algorithm as a function of frequency is
T (p,X) =
(
4ts
√
p+ tw
(
p2 +
n√
p
))
+
(
2ks
n
p
log
(
n
p
)
+ ksp
2 log p2 + p log
n
p
)
β
1
X
(8.15)
we now frame an expression for the cost function C(p,X) of the parallel algorithm using Eq. 5.5:
C(p,X) = α(Ecomp + Ecomm + Eleak) + T (p,X)
= α(Ed
(
2ksn log
(
n
p
)
+ ksp
2 log p2 + p2 log
n
p
)
βX2 + ewh
(
p2
√
p+ n
√
p
)
)
+ α(Elp
((
2ks
n
p
log
(
n
p
)
+ ksp
2 log p2 + p log
n
p
)
β +
(
4ts
√
p+ tw
(
p2 +
n√
p
))
X
)
)
+
(
4ts
√
p+ tw
(
p2 +
n√
p
))
+
(
2ks
n
p
log
(
n
p
)
+ ksp
2 log p2 + p log
n
p
)
β
1
X
Asymptotic Analysis:: If n  p3 the cost expression of the parallel sample sort algorithm on 2-dimensional mesh
interconnect with p cores at frequency X can be approximated as
C(p,X) ≈ O
(
n log nX2 + n
√
p log p+ n log n
1
pX
)
The optimal number of cores and frequency required for minimum cost varies with problem size as follows
popt = O
(
(logW )
6
7
)
Fopt = O
(
(logW )
−2
7
)
8.5 Summary
The results of this chapter are summarized in the table 8.1.
Parallel Algorithm Energy scalability Energy bounded Utility based
under iso-performance scalability scalability
Bitonic Sort O
(
(logW )
2
5
)
O
(
(logW )2
)
O
(
(logW )
6
7
)
Odd-even Transposition Sort O
(
(logW )
1
3
)
O (logW ) O
(
(logW )
3
5
)
Naive Quicksort O
((
logW
log logW
) 2
5
)
O
(
(logW )
2
3
)
O
((
logW
log logW
) 6
7
)
Efficient Quicksort O
(
(logW )
2
5
)
O
(
(logW )2
)
O
(
(logW )
6
7
)
Sample Sort O
(
(logW )
2
5
)
O
(
(logW )2
)
O
(
(logW )
6
7
)
Table 8.1: Scalability metrics of parallel sorting algorithms on 2D mesh interconnect
102
Chapter 9
Graph Algorithms
Graph theory plays an important role in computer science because it provides an easy and systematic way to model
many problems. Many problems can be expressed in terms of graphs, and can be solved using standard graph al-
gorithms. This chapter analyzes parallel formulations of some important and fundamental graph algorithms for our
scalability metrics. Here, we focus on algorithms for dense graphs and assume that dense graphs are represented by
an adjacency matrix.
9.1 Minimum Spanning Tree: Prim’s Algorithm
A spanning tree of an undirected graph G is a subgraph of G that is a tree containing all vertices of G. In a weighted
graph, the weight of a subgraph is the sum of the weights of the edges in the subgraph. A minimum spanning tree for
a weighted undirected graph is a spanning tree with minimum weight. Prim’s algorithm is a greedy method for finding
a MST. The algorithm begins by selecting an arbitrary starting vertex. It then grows the minimum spanning tree by
choosing a new vertex and edge that are guaranteed to be in the minimum spanning tree. The algorithm continues
until all the vertices have been selected. We provide the code for the algorithm below.
PRIM MST(V,E,w, r)
1: VT = {r};
2: d[r] = 0;
3: for all v ∈ (V − VT ) do
4: if edge(r, c) exists then
5: set d[v] = w(r, v)
6: else
7: set d[v] =∞
8: end if
9: while VT 6= V do
10: find a vertex u such that d[u] = min{d[v]|v ∈ (V − VT )};
103
11: VT = VT ∪ {u}
12: for all v ∈ (V − VT ) do
13: d[v] = min{d[v], w(u, v)};
14: end for
15: end while
16: end for
In the above program, the body of the while loop (lines 10–13) is executed n − 1 times. Both the number of
comparisons performed for evaluating min{d[v]|v ∈ (V − VT )} (Line 10) and the number of comparisons performed
in the for loop (Lines 12 and 13) decreases by one for each iteration of the main loop. Thus, by simple arithmetic, the
overall number of comparisons done by the algorithm is around n2 (ignoring lower order terms). Thus, the problem
size W = O(n2). The time taken by the algorithm on a single core, running at the maximum frequency is given by
Tseq = β · n2/F , where β is number of cycles required to compare the weights of two edges.
We consider the parallel version of Prim’s algorithm in [61]. Let p be the number of cores, and let n be the number
of vertices in the graph. The set V is partitioned into p subsets such that each subset has n/p consecutive vertices.
The work associated with each subset is assigned to a different core. Let Vi be the subset of vertices assigned to core
Ci for i = 0, 1, · · · , p− 1. Each core Ci stores the part of the array d that corresponds to Vi. Each core Ci computes
di[u] = min{di[v]|v ∈ ((V \VT )∩Vi)} during each iteration of the while loop. The global minimum is then obtained
over all di[u] by sending the di[u] values from each core to core C0. Core C0 now holds the new vertex u, which will
be inserted into VT . Core C0 broadcasts u to all cores. The core Ci responsible for vertex u marks u as belonging to
set VT . Finally, each processor updates the values of d[v] for its local vertices. When a new vertex u is inserted into
VT , the values of d[v] for v ∈ (V \VT ) must be updated. The core responsible for v must know the weight of the edge
(u, v). Hence, each core Ci needs to store the columns of the weighted adjacency matrix corresponding to the set Vi
of vertices assigned to it.
For the parallel algorithm, each core performs about n2/p comparisons. Each edge to add to the tree is selected
by each core finding the best local candidate followed by a global reduction (single node accumulation). Since finding
the a global minimum and performing a one-to-all broadcast take (ts+ tw) log p+2th(
√
p− 1) time on a p-processor
mesh, communication time of the parallel algorithm is n((ts+tw) log p+2th(
√
p−1)). Furthermore, energy spent for
communication for the whole algorithm is (1/4)ewhnp log p. The total number of computation steps (comparisons)
at all cores on average is n2. In Summary
pincomp =
n2
p
β
104
pitcomm = n((ts + tw) log p+ 2th(
√
p− 1))
Scomp = n
2β
Ecomm =
1
4
ewhnp log p
where β is number of cycles required for a single comparison.
Energy Scalability under Iso-performance
We now compute the energy scalability under iso-performance of parallel Prim’s minimum spanning tree algorithm
on 2-dimensional mesh interconnect. We first Scale the computation steps of critical path so that the performance
of parallel algorithm matches the specified performance requirement. Assuming the performance target to be the
time taken by the best sequential algorithm on a single core (processor) operating at maximum frequency F , the new
reduced frequency at which all p cores should run is given by:
X =
pincomp
Performance Target− pitcomm
=
n2
p β
n2β 1F − n((ts + tw) log p+ 2th(
√
p− 1))
The total active time at all the cores at new frequency is given by pn2β(1/F ). Therefore, expression for energy
consumption of the parallel Prim’s minimum spanning tree algorithm as per equation 5.1 is given by
E = Ecomp + Ecomm + Eleak
= Edn
2βX2 +
1
4
ewhnp log p+ Elpn
2β
X
F
Asymptotic Analysis:: Note that, If n  √p then X ≈ F/p. Thus, the energy consumed by the parallel Prim’s
minimum spanning tree algorithm on 2-dimensional mesh interconnect with p cores running at frequency X is given
by:
E = Edn
2β
F 2
p2
+
1
4
ewhnp log p+ Eln
2β (9.1)
The optimal number of cores required for minimum energy consumption is given by
popt = O
((
n
log n
) 1
3
)
105
= O
(
W
1
6
(logW )
1
3
)
Note that, n should be greater than
√
p for this asymptotic result to apply.
Energy Bounded Scalability
We now evaluate the energy bounded scalability of the parallel Prim’s minimum spanning tree algorithm on 2-
dimensional mesh interconnect. The total active time (Tactive) at all the cores as a function of the frequency of
the cores is given by
Tactive = p(pincomp
1
X
+ pitcomm)
= p(
n2
p
β
1
X
+ n((ts + tw) log p+ 2th(
√
p− 1)))
Expression for the energy consumed by the parallel algorithm as a function of the frequency of the cores, using
the energy model is given by
Ecomp = Edn
2βX2
Eleak = El · Tactive ·X
= Elp
(
n2
p
β + n((ts + tw) log p+ 2th(
√
p− 1))X
)
Here, we fix the energy budget E to be that of the energy required for the sequential algorithm, running on a single
core at maximum frequency F (Eseq). Given the energy budget Eseq , the frequency X with which the cores should
run is given by
X ≈ −Elppitcomm +
√
E2l p
2pi2tcomm + 4EdScomp(E − Ecomm − Elppincomp)
2EdScomp
=
−Elpn((ts + tw) log p+ 2th(√p− 1))
2Edn2β
+
√
E2l p
2n2((ts + tw) log p+ 2th(
√
p− 1))2 + 4Edn2β(Edn2βF 2 − 14ewhnp log p− Eln2β)
2Edn2β
The time taken (inverse of performance) by the parallel algorithm as a function of the new frequency X is
Time Taken = n((ts + tw) log p+ 2th(
√
p− 1)) + n
2
p
β
1
X
(9.2)
106
Asymptotic Analysis: Note that, If n  p log p, then X ≈ F . Thus, the time taken by the parallel Prim’s minimum
spanning tree algorithm running on p cores is given by:
Time Taken = n((ts + tw) log p+ 2th(
√
p− 1)) + n
2
p
β
1
F
The optimal number of cores required for maximum performance under the energy budget is given by
popt = O
(
n
2
3
)
= O
(
W
1
3
)
Thus, the asymptotic energy bounded scalability of the parallel algorithm is W 1/3. Note that, n should be greater than
p log p for this asymptotic result to apply.
Utility Based Scalability
We now evaluate the utility based scalability of the parallel Prim’s minimum spanning tree algorithm on 2-dimensional
mesh interconnect. Expression for the energy consumed by the parallel algorithm as a function of the frequency of the
cores, using the energy model is given by
Ecomp = Edn
2ββX2
Eleak = El · Tactive ·X
= Elp
(
n2
p
β + n((ts + tw) log p+ 2th(
√
p− 1))X
)
The time taken (inverse of performance) by the parallel algorithm as a function of frequency is
T (p,X) = n((ts + tw) log p+ 2th(
√
p− 1)) + n
2
p
β
1
X
(9.3)
we now frame an expression for the cost function C(p,X) of the parallel algorithm using Eq. 5.5:
C(p,X) = α(Ecomp + Ecomm + Eleak) + T (p,X)
= α(Edn
2βX2 +
1
4
ewhnp log p)
+ α(Elp
(
n2
p
β + n((ts + tw) log p+ 2th(
√
p− 1))X
)
)
+ n((ts + tw) log p+ 2th(
√
p− 1)) + n
2
p
β
1
X
107
Asymptotic Analysis:: The cost expression of the parallel Prim’s minimum spanning tree algorithm on 2-dimensional
mesh interconnect with p cores at frequency X can be approximated as
C(p,X) ≈ O
(
n2X2 + np log p+
n2
pX
)
The optimal number of cores and frequency required for minimum cost varies with problem size as follows
popt = O
((
n
log n
) 3
5
)
= O
(
W
3
10
(logW )
3
5
)
Fopt = O
((
n
log n
)−1
5
)
= O
(
W−
1
10
(logW )−
1
5
)
9.2 Single-Source Shortest Paths: Dijkstra’s Algorithm
For a weighted graph G = (V,E,w), the single-source shortest paths problem is to find the shortest paths from a
vertex v ∈ V to all other vertices in V . A shortest path from u to v is a minimum-weight path. In this section, we
consider Dijkstra’s algorithm, which solves the single source shortest-paths problem on both directed and undirected
graphs with non-negative weights. Dijkstra’s algorithm, which finds the shortest paths from a single vertex s, is similar
to Prim’s minimum spanning tree algorithm. Like Prim’s algorithm, it incrementally finds the shortest paths from s to
the other vertices of G. It is also greedy; that is, it always chooses an edge to a vertex that appears closest. Dijkstra’s
algorithm and Prim’s minimum spanning tree algorithm are almost identical. The main difference is that, for each
vertex u ∈ (V − VT ), Dijkstra’s algorithm stores l[u], the minimum cost to reach vertex u from vertex s by means of
vertices in VT ; Prim’s algorithm stores d[u], the cost of the minimum-cost edge connecting a vertex in VT to u. The
run time of Dijkstra’s algorithm is θ(n2).
The parallel formulation of Dijkstra’s single-source shortest path algorithm is very similar to the parallel for-
mulation of Prim’s algorithm for minimum spanning trees. The weighted adjacency matrix is partitioned using the
block mapping. Each of the p processes is assigned n/p consecutive columns of the weighted adjacency matrix, and
computes n/p values of the array l. During each iteration, all processes perform computation and communication
similar to that performed by the parallel formulation of Prim’s algorithm. Consequently, the parallel performance and
scalability of Dijkstra’s single-source shortest path algorithm is identical to that of Prim’s minimum spanning tree
algorithm.
108
9.3 All-Pairs Shortest Paths: Floyd’s Algorithm
Instead of finding the shortest paths from a single vertex v to every other vertex, we are sometimes interested in finding
the shortest paths between all pairs of vertices. Formally, given a weighted graph G(V,E,w), the all-pairs shortest
paths problem is to find the shortest paths between all pairs of vertices vi, vj ∈ V such that i 6= j. For a graph with n
vertices, the output of an all-pairs shortest paths algorithm is an n × n matrix D = (di,j) such that di,j is the cost of
the shortest path from vertex vi to vertex vj .
Floyd’s algorithm for solving the all-pairs shortest paths problem is based on the following observation. Let
G = (V,E,w) be the weighed graph, and let V = {v1, ..., vn} be the vertices of G. Consider a subset {v1, ..., vk}
of vertices for some k where k ≤ n. For any pairs of vertices vi, vj ∈ V , consider all paths from vi to vj whose
intermediate vertices belong to the set {v1, ..., vk}. LetP (k)i,j be the minimum-weight path among them, and let d(k)i,j
be the weight of P (k)i,j . If vertex vk is not in the shortest path from vi to vj , then P
(k)
i,j is the same as P
(k−1)
i,j . However,
if vk is in P
(k)
i,j , then we can break P
(k)
i,j into two paths- one from vi to vk and one from vk to vj . Each of these paths
uses vertices from {v1, v2, ..., vk−1}. Thus, d(k)i,j = d(k−1)i,k + d(k−1)k,j . The length of the shortest path from vi to vj is
given by d(n)i,j . In general, the solution is a matrix D
(n) = (d
(n)
i,j ). Floyd’s algorithm solves the recursion bottom up in
the order of increasing values of k. The run time complexity of the sequential algorithm is θ(n3). Thus, problem size
W = O(n3). Note that only matrix D(k−1) is needed while computing matrix D(k).
A generic parallel formulation of Floyd’s algorithm assigns the task of computing matrix D(k) for each value of k
to a set of processes. Let p be the number of processes available. Matrix D(k) is partitioned into p parts, and each part
is assigned to a process. Each process computes the D(k) values of its partition. To accomplish this, a process must
access the corresponding segments of the kth row and column of matrix D(k−1). One way to partition matrix D(k) is
to use the block checkerboard mapping. Specifically, matrix D(k) is divided into p squares of size(n/
√
p)× (n/√p),
and each square is assigned to one of the p processors. These p processors are arranged on a 2-d mesh of size
√
p×√p.
We refer to the processor on the ith row and the jth column as Pi,j . Each processor updates its part of the matrix
during each iteration. During the kth iteration of the algorithm, each processor Pi,j needs certain segments of the
kth row and kth column of the D(k−1) matrix. Segments are transferred as follows. During the kth iteration of the
algorithm, each of the
√
p processors containing part of the kth row send it to the
√
p − 1 processors in the same
column. similarly, each the
√
p processors containing path of the kth column send it to the
√
p − 1 processors in the
same row.
During each iteration of the algorithm, the kth row and kth column of processors perform a one-to-all broadcast
along a row or a column of
√
p processors. Each such processor has n/
√
p elements of the kth row or column,
so it sends n/
√
p elements. Thus, total time taken and energy spent during the communication phases is n(2(ts +
tw(n/
√
p) log(
√
p)) + 2th(
√
p − 1)) and (1/2)ewhn2√p log p respectively. Since each processor is assigned n2/p
109
elements of theD(k) matrix, the number of computation steps required to compute correspondingD(k) matrix is n2/p.
In summary,
pincomp =
n3
p
β
pitcomm = n
((
ts + tw
n√
p
)
log p+ 2th (
√
p− 1)
)
Scomp = n
3β
Ecomm =
1
2
ewhn
2√p log p
where β is number of cycles required for a single addition.
Energy Scalability under Iso-performance
We now compute the energy scalability under iso-performance of parallel Floyd’s all-pairs shortest paths algorithm
on 2-dimensional mesh interconnect. We first Scale the computation steps of critical path so that the performance
of parallel algorithm matches the specified performance requirement. Assuming the performance target to be the
time taken by the best sequential algorithm on a single core (processor) operating at maximum frequency F , the new
reduced frequency at which all p cores should run is given by:
X =
pincomp
Performance Target− pitcomm
=
n3
p β
Edn3βF 2 − n
((
ts + tw
n√
p
)
log p+ 2th
(√
p− 1))
The total active time at all the cores at new frequency is given by pn3β(1/F ). Therefore, expression for energy
consumption of the parallel Floyd’s all-pairs shortest paths algorithm as per equation 5.1 is given by
E = Ecomp + Ecomm + Eleak
= Edn
3βX2 +
1
2
ewhn
2√p log p+ Elpn3βX
F
Asymptotic Analysis:: Note that, If n  p1/4 then X ≈ F/p. Thus, the energy consumed by the parallel Floyd’s
all-pairs shortest paths algorithm on 2-dimensional mesh interconnect with p cores running at frequency X is given
by:
110
E = Edn
3β
F 2
p2
+
1
2
ewhn
2√p log p+ Eln3β (9.4)
The optimal number of cores required for minimum energy consumption is given by
popt = O
((
n
log n
) 2
5
)
= O
(
W
2
15
(logW )
2
5
)
Thus, the asymptotic energy scalability under iso-performance of the parallel Floyd’s all-pairs shortest paths algorithm
on 2-dimensional mesh interconnect isO(). Note that, n should be greater than p1/4 for this asymptotic result to apply.
Energy Bounded Scalability
We now evaluate the energy bounded scalability of the parallel Floyd’s all-pairs shortest paths algorithm on 2-
dimensional mesh interconnect. The total active time (Tactive) at all the cores as a function of the frequency of
the cores is given by
Tactive = p(pincomp
1
X
+ pitcomm)
= p(
n3
p
β
1
X
+ n
((
ts + tw
n√
p
)
log p+ 2th (
√
p− 1)
)
)
Expression for the energy consumed by the parallel algorithm as a function of the frequency of the cores, using
the energy model is given by
Ecomp = Edn
3βX2
Eleak = El · Tactive ·X
= Elp
(
n3
p
β + n
((
ts + tw
n√
p
)
log p+ 2th (
√
p− 1)
)
X
)
Here, we fix the energy budget E to be that of the energy required for the sequential algorithm, running on a single
core at maximum frequency F (Eseq). Given the energy budget Eseq , the frequency X with which the cores should
run is given by
X ≈ −Elppitcomm +
√
E2l p
2pi2tcomm + 4EdScomp(E − Ecomm − Elppincomp)
2EdScomp
111
=
−Elpn
((
ts + tw
n√
p
)
log p+ 2th
(√
p− 1))
2Edn3β
+
√
E2l p
2n2
((
ts + tw
n√
p
)
log p+ 2th
(√
p− 1))2 + 4Edn3β(Edn3βF 2 − 12ewhn2√p log p− Eln3β)
2Edn3β
The time taken (inverse of performance) by the parallel algorithm as a function of the new frequency X is
Time Taken = n
((
ts + tw
n√
p
)
log p+ 2th (
√
p− 1)
)
+
n3
p
β
1
X
(9.5)
Asymptotic Analysis: Note that, If n √p log p, then X ≈ F . Thus, the time taken by the parallel Floyd’s all-pairs
shortest paths algorithm running on p cores is given by:
Time Taken = n
((
ts + tw
n√
p
)
log p+ 2th (
√
p− 1)
)
+
n3
p
β
1
F
The optimal number of cores required for maximum performance under the energy budget is given by
popt = O
(
n
4
3
)
= O
(
W
4
9
)
Thus, the asymptotic energy bounded scalability of the parallel algorithm is n4/3. Note that, n should be greater than
√
p log p for this asymptotic result to apply.
Utility Based Scalability
We now evaluate the utility based scalability of the parallel Floyd’s all-pairs shortest paths algorithm on 2-dimensional
mesh interconnect. Expression for the energy consumed by the parallel algorithm as a function of the frequency of the
cores, using the energy model is given by
Ecomp = Edn
3ββX2
Eleak = El · Tactive ·X
= Elp
(
n3
p
β + n
((
ts + tw
n√
p
)
log p+ 2th (
√
p− 1)
)
X
)
The time taken (inverse of performance) by the parallel algorithm as a function of frequency is
T (p,X) = n
((
ts + tw
n√
p
)
log p+ 2th (
√
p− 1)
)
+
n3
p
β
1
X
(9.6)
112
we now frame an expression for the cost function C(p,X) of the parallel algorithm using Eq. 5.5:
C(p,X) = α(Ecomp + Ecomm + Eleak) + T (p,X)
= α(Edn
3βX2 +
1
2
ewhn
2√p log p)
+ α(Elp
(
n3
p
β + n
((
ts + tw
n√
p
)
log p+ 2th (
√
p− 1)
)
X
)
)
+ n
((
ts + tw
n√
p
)
log p+ 2th (
√
p− 1)
)
+
n3
p
β
1
X
Asymptotic Analysis: The cost expression of the parallel Floyd’s all-pairs shortest paths algorithm on 2-dimensional
mesh interconnect with p cores at frequency X can be approximated as
C(p,X) ≈ O
(
n3X2 + n2
√
p log p+
n3
pX
)
The optimal number of cores and frequency required for minimum cost varies with problem size as follows
popt = O
((
n
log n
) 6
7
)
= O
(
W
2
7
(logW )
6
7
)
Fopt = O
((
n
log n
)−2
7
)
= O
(
W−
2
21
(logW )−
2
7
)
9.4 Summary
The results of this chapter are summarized in the table 9.1.
Parallel Algorithm Energy scalability Energy bounded Utility based
under iso-performance scalability scalability
Minimum Spanning Tree: Prim’s Algorithm O
(
W
1
6
(logW )
1
3
)
O
(
W
1
3
)
O
(
W
3
10
(logW )
3
5
)
Single-Source Shortest Paths: Dijkstra’s Algorithm O
(
W
1
6
(logW )
1
3
)
O
(
W
1
3
)
O
(
W
3
10
(logW )
3
5
)
All-Pairs Shortest Paths: Floyd’s Algorithm O
(
W
2
15
(logW )
2
5
)
O
(
W
4
9
)
O
(
W
2
7
(logW )
6
7
)
Table 9.1: Scalability metrics of graph algorithms on 2D mesh interconnect
113
Chapter 10
Fast Fourier Transform
The discrete Fourier transform (DFT) plays an important role in many scientific and technical applications, including
time series and waveform analysis, solutions to linear partial differential equations, convolution, digital signal pro-
cessing, and image filtering. The DFT is a linear transformation that maps n regularly sampled points from a cycle of
a periodic signal, like a sine wave, onto an equal number of points representing the frequency spectrum of the signal.
Several different forms of the FFT algorithm exist. This chapter discusses its simplest form, the one-dimensional, un-
ordered, radix-2 FFT. In this chapter we discuss two parallel formulations of the basic algorithm: the binary exchange
algorithm and the transpose algorithm.
Consider a sequenceX =< X[0], X[1], ..., X[n−1] > of length n. The discrete Fourier transform of the sequence
X is the sequence Y =< Y [0], Y [1], ..., Y [n− 1] >, where
Y [i] = Σn−1k=0X[k]ω
ki, 0 ≤ i < n
Where ω is the primitive nth root of unity in the complex plane. More generally, the powers of ω in the equation
can be thought of as elements of the finite commutative ring of integers modulo n. The powers of ω used in an FFT
computation are also known as twiddle factors. The computation of each Y [i] according to above equation requires
n complex multiplications. Therefore, the sequential complexity of computing the entire sequence Y of length n is
θ(n2). The fast Fourier transform algorithm described below reduces this complexity to θ(n log n). Assume that n is
a power of two. The FFT algorithm is based on the following step that permits an n-point DFT computation to be split
into two (n/2)-point DFT computations:
Y [i] = Σ
(n/2)−1
k=0 X[2k]ω
2ki + ωiΣ
(n/2)−1
k=0 X[2k + 1]ω
2ki
Here, each of the two summations on the right-hand side is an (n/2)-point DFT computation. If n is a power of two,
each of these DFT computations can be divided similarly into smaller computations in a recursive manner. This leads
to the recursive FFT algorithm. This FFT algorithm is called the radix-2 algorithm because at each level of recursion,
the input sequence is split into two equal halves. The size of the input sequence over which an FFT is computed
114
recursively decreases by a factor of two at each level of recursion. Hence, the maximum number of levels of recursion
is log n for an initial sequence of length n. At the mth level of recursion, 2m FFTs of size n/2m each are computed.
Thus, the total number of arithmetic operations at each level is θ(n) and the overall sequential complexity of this
algorithm is θ(n log n). Thus, the problem size W = O(n log n).
10.1 The Binary-Exchange Algorithm
This section discusses the binary-exchange algorithm for performing FFT on a parallel computer. First, a decomposi-
tion is induced by partitioning the input or the output vector. Therefore, each task starts with one element of the input
vector and computes the corresponding element of the output. If each task is assigned the same label as the index of
its input or output element, then in each of the log n iterations of the algorithm, exchange of data takes place between
pairs of tasks with labels differing in one bit position.
Assume that an n-point FFT is computed on a p-processor mesh with
√
p rows and
√
p columns and that
√
p is a
power of two. Let n = 2r and p = 2d. Also assume that the processors are labeled in a row-major fashion and that the
data are distributed such that an element with index (b0b1, ...br−1) is mapped onto the processor labeled (b0b1...bd−1).
Note that, communication takes place during the first log p iterations between processors whole labels differ in one
bit. More precisely, in log
√
p of the log p steps that require communication, the communicating processors are in the
same row, and in the remaining log
√
p steps, they are in the same column. The distance between the communicating
processors in a row or a column grows from one link to
√
p/2 links, doubling in each of the log
√
p steps. The same
relationship holds for any processor in the mesh. Thus the the total time spent in performing row-wise communication
is Σd/2−1m=0 (ts + tw(n/p)2
m) for a mesh. An equal amount of time is spent in column-wise communication. Since all
processor do the same amount of work, the total energy spent during the communication is 2pΣd/2−1m=0 ewh(n/p)2
m.
Note that each processor performs n/p computation steps (complex multiplication and addition) in each of the log n
iterations. In summary,
pincomp =
n
p
log nβ
pitcomm =
(
ts log p+ 2tw
n√
p
)
Scomp = n log nβ
Ecomm = (1/4)ewhn
√
p
where β is number of cycles required for a complex multiplication and addition operation.
115
Energy Scalability under Iso-performance
We now compute the energy scalability under iso-performance of parallel binary-exchange algorithm on 2-dimensional
mesh interconnect. We first scale the computation steps of the critical path so that the performance of parallel algorithm
matches the specified performance requirement. Assuming the performance target to be the time taken by the best
sequential algorithm on a single core (processor) operating at maximum frequency F , the new reduced frequency at
which all p cores should run is given by:
X =
pincomp
Performance Target− pitcomm
=
n
p log nβ
n lognβ
F −
(
ts log p+ 2tw
n√
p
)
The total active time at all the cores at new frequency is given by pn log nβ(1/F ). Therefore, the expression for
energy consumption of the parallel binary-exchange algorithm as per equation 5.1 is given by
E = Ecomp + Ecomm + Eleak
= Edn log nβX
2 + (1/4)ewhn
√
p+ +Elpn log n
X
F
Asymptotic Analysis:: Note that X ≈ F/p. Thus, the energy consumed by the parallel binary-exchange algorithm
on a 2-dimensional mesh interconnect with p cores running at frequency X is given by:
E = Edn log nβ
F 2
p2
+ (1/4)ewhn
√
p+ Eln log n (10.1)
The optimal number of cores required for minimum energy consumption is given by
popt = O
(
(log n)
2
5
)
= O
(
(logW )
2
5
)
Thus, the asymptotic energy scalability under iso-performance of the parallel binary-exchange algorithm on 2-dimensional
mesh interconnect is O((log n)2/5).
116
Energy Bounded Scalability
We now evaluate the energy bounded scalability of the parallel binary-exchange algorithm on 2-dimensional mesh
interconnect. The total active time (Tactive) at all the cores as a function of the frequency of the cores is given by
Tactive = p
(
pincomp
1
X
+ pitcomm
)
= p
(
n
p
log nβ
1
X
+
(
ts log p+ 2tw
n√
p
))
The expression for the energy consumed by the parallel algorithm as a function of the frequency of the cores, using
the energy model is given by
Ecomp = Edn log nβX
2
Eleak = El · Tactive ·X
= Elp
(
n
p
log nβ +
(
ts log p+ 2tw
n√
p
)
X
)
Here, we fix the energy budget E to be that of the energy required for the sequential algorithm, running on a single
core at maximum frequency F (E). Given the energy budget Eseq , the frequency X with which the cores should run
is given by
X ≈ −Elppitcomm +
√
E2l p
2pi2tcomm + 4EdScomp(E − Ecomm − Elppincomp)
2EdScomp
=
−Elp
(
ts log p+ 2tw
n√
p
)
2Edn log nβ
+
√
E2l p
2
(
ts log p+ 2tw
n√
p
)2
+ 4Edn log nβ(Edn log nβF 2 − (1/4)ewhn√p− Eln log nβ)
2Edn log nβ
The time taken (inverse of performance) by the parallel algorithm as a function of the new frequency X is
Time Taken =
(
ts log p+ 2tw
n√
p
)
+
n
p
log nβ
1
X
(10.2)
Asymptotic Analysis: Note that, if n  2√p, then X ≈ F . Thus, the time taken by the parallel binary-exchange
algorithm running on p cores is given by:
Time Taken =
(
ts log p+ 2tw
n√
p
)
+
n
p
log nβ
1
F
117
The optimal number of cores required for maximum performance under the energy budget is given by
popt = O
(
(log n)2
)
= O
(
(logW )2
)
Thus, the asymptotic energy bounded scalability of the parallel algorithm is (logW )2. Note that, n should be expo-
nentially greater than
√
p for this asymptotic result to apply.
Utility Based Scalability
We now evaluate the utility based scalability of the parallel binary-exchange algorithm on a 2-dimensional mesh
interconnect. The expression for the energy consumed by the parallel algorithm as a function of the frequency of the
cores, using the energy model is given by
Ecomp = Edn log nβX
2
Eleak = El · Tactive ·X
= Elp
(
n
p
log nβ +
(
ts log p+ 2tw
n√
p
)
X
)
The time taken (inverse of performance) by the parallel algorithm as a function of frequency is
T (p,X) =
(
ts log p+ 2tw
n√
p
)
+
n
p
log nβ
1
X
(10.3)
we now frame an expression for the cost function C(p,X) of the parallel algorithm using Eq. 5.5:
C(p,X) = α(Ecomp + Ecomm + Eleak) + T (p,X)
= α(Edn log nβX
2 + (1/4)ewhn
√
p)
+ α
(
Elp
(
n
p
log nβ +
(
ts log p+ 2tw
n√
p
)
X
))
+
(
ts log p+ 2tw
n√
p
)
+
n
p
log nβ
1
X
Asymptotic Analysis:: The cost expression of the parallel binary-exchange algorithm on 2-dimensional mesh inter-
connect with p cores at frequency X can be approximated as
C(p,X) ≈ O
(
n log nX2 + n
√
p+ n log n
1
pX
)
118
The optimal number of cores and frequency required for minimum cost varies with problem size as follows
popt = O
(
(log n)
6
7
)
= O
(
(logW )
6
7
)
Fopt = O
(
(log n)
−2
7
)
= O
(
(logW )
−2
7
)
10.2 Two-Dimensional Transpose Algorithm
The simplest transpose algorithm requires a single transpose operation over a two-dimensional array; hence, we call
this algorithm the two-dimensional transpose algorithm.
Assume that
√
n is a power of two, and that the sequences of size n are arranged in a
√
n×√n two-dimensional
square array as described earlier. Recall that computing the FFT of a sequence of n points requires log n iterations.
The FFT computation in each column can proceed independently for log
√
n iterations without any column requiring
data from any other column. Similarly, in the remaining log
√
n iterations, computation proceeds independently in
each row without any row requiring data from any other row. If data of size n are arranged in an
√
n × √n array,
then an n-point FFT computation is equivalent to
√
n independent
√
n-point FFT computations in the columns of the
array, followed by
√
n independent
√
n-point FFT computations in the rows.
If the
√
n × √n array of data is transposed after computing the √n-point column FFTs, then the remaining part
of the problem is to compute the
√
n-point column-wise FFTs of the transposed matrix. The transpose algorithm uses
this property to compute the FFT in parallel by using a column-wise striped partitioning to distribute the
√
n × √n
array of data among the p processes. The two-dimensional transpose algorithm works in three phases. In the first
phase, a
√
n-point FFT is computed for each column. In the second phase, the array of data is transposed. The third
and final phase is identical to the first phase, and involves the computation of
√
n-point FFTs for each column of the
transposed array. Note that the first and third phases of the algorithm do not require any interprocess communication.
In both these phases, all
√
n points for each column-wise FFT computation are available on the same process. Only
the second phase requires communication for transposing the
√
n×√n matrix.
We now consider the more general case in which p processes are used and 1 ≤ p ≤ √n. The √n × √n array
of data is striped into blocks, and one block of
√
n/p rows is assigned to each processor. In the first and third phases
of the algorithm, each process computes
√
n/p FFTs of size
√
n each. The second phase transposes the
√
n × √n
matrix, which is distributed among p processes with a one-dimensional partitioning. Recall from Section 7.1 that such
a transpose requires an all-to-all personalized communication (Section 3.2.3). Thus, the time and energy spent during
119
the second (communication) phase are respectively, (2ts + tw(n/p))(
√
p− 1) and ewhn(√p− 1). Moreover, the first
and third phase each take (
√
n/p)(
√
n log
√
n) computation steps. In summary,
pincomp =
n
p
log nβ
pitcomm =
(
2ts + tw
n
p
)
(
√
p− 1)
Scomp = n log nβ
Ecomm = ewhn(
√
p− 1)
where β is number of cycles required for a complex multiplication and addition operation.
Energy Scalability under Iso-performance
We now compute the energy scalability under iso-performance of the parallel transpose algorithm on a 2-dimensional
mesh interconnect. We first scale the computation steps of the critical path so that the performance of the parallel
algorithm matches the specified performance requirement. Assuming the performance target to be the time taken
by the best sequential algorithm on a single core (processor) operating at maximum frequency F , the new reduced
frequency at which all p cores should run is given by:
X =
pincomp
Performance Target− pitcomm
=
n
p log nβ
n lognβ
F −
(
2ts + tw
n
p
)
(
√
p− 1)
The total active time at all the cores at new frequency is given by pn log nβ 1F . Therefore, the expression for the
energy consumption of the parallel tranpose algorithm as per equation 5.1 is given by
E = Ecomp + Ecomm + Eleak
= Edn log nβX
2 + ewhn(
√
p− 1) + +Elpn log nβX
F
Asymptotic Analysis:: As we have seen in binary-exchange algorithm, here too X ≈ F/p. Thus, the energy con-
sumed by the parallel transpose algorithm on 2-dimensional mesh interconnect with p cores running at frequency X
is given by:
120
E = Edn log nβ
F 2
p2
+ ewhn(
√
p− 1) + Eln log nβ (10.4)
Since the equation obtained is very similar to the equation obtained for binary-exchange algorithm, same asymp-
totic result holds for the transpose algorithm.
Energy Bounded Scalability
We now evaluate the energy bounded scalability of the parallel transpose algorithm on 2-dimensional mesh intercon-
nect. The total active time (Tactive) at all the cores as a function of the frequency of the cores is given by
Tactive = p
(
pincomp
1
X
+ pitcomm
)
= p
(
n
p
log nβ
1
X
+
(
2ts + tw
n
p
)
(
√
p− 1)
)
The expression for the energy consumed by the parallel algorithm as a function of the frequency of the cores, using
the energy model is given by
Ecomp = Edn log nβX
2
Eleak = El · Tactive ·X
= Elp
(
n
p
log nβ +
(
2ts + tw
n
p
)
(
√
p− 1)X
)
Here, we fix the energy budget E to be that of the energy required for the sequential algorithm, running on a single
core at maximum frequency F (E). Given the energy budget Eseq , the frequency X with which the cores should run
is given by
X ≈ −Elppitcomm +
√
E2l p
2pi2tcomm + 4EdScomp(E − Ecomm − Elppincomp)
2EdScomp
=
−Elp
(
2ts + tw
n
p
)
(
√
p− 1)
2Edn log nβ
+
√
E2l p
2
(
2ts + tw
n
p
)2
(
√
p− 1)2 + 4Edn log nβ(E − ewhn(√p− 1)− Eln log nβ)
2Edn log nβ
121
The time taken (inverse of performance) by the parallel algorithm as a function of the new frequency X is
Time Taken =
(
2ts + tw
n
p
)
(
√
p− 1) + n
p
log nβ
1
X
(10.5)
Asymptotic Analysis: Note that, if n  2√p, then X ≈ F . Thus, the time taken by the parallel transpose algorithm
running on p cores is given by:
Time Taken =
(
2ts + tw
n
p
)
(
√
p− 1) + n
p
log nβ
1
F
The optimal number of cores required for maximum performance under the energy budget is given by
popt = O
(
(log n)2
)
= O
(
(logW )2
)
Thus, the asymptotic energy bounded scalability of the parallel algorithm is (log n)2. Although the expression for
time taken differs from that of the binary-exchange algorithm, the asymptotic result of both the algorithms remains
the same under the assumption that n is exponentially greater than
√
p.
Utility Based Scalability
We now evaluate the utility based scalability of the parallel transpose algorithm on 2-dimensional mesh interconnect.
The expression for the energy consumed by the parallel algorithm as a function of the frequency of the cores, using
the energy model is given by
Ecomp = Edn log nβX
2
Eleak = El · Tactive ·X
= Elp
(
n
p
log nβ +
(
2ts + tw
n
p
)
(
√
p− 1)X
)
The time taken (inverse of performance) by the parallel algorithm as a function of frequency is
T (p,X) =
(
2ts + tw
n
p
)
(
√
p− 1) + n
p
log nβ
1
X
(10.6)
we now frame an expression for the cost function C(p,X) of the parallel algorithm using Eq. 5.5:
C(p,X) = α(Ecomp + Ecomm + Eleak) + T (p,X)
122
= α(Edn log nβX
2 + ewhn(
√
p− 1))
+ α
(
Elp
(
n
p
log nβ +
(
2ts + tw
n
p
)
(
√
p− 1)X
))
+
(
2ts + tw
n
p
)
(
√
p− 1) + n
p
log nβ
1
X
Asymptotic Analysis:: The cost expression of the parallel transpose algorithm on 2-dimensional mesh interconnect
with p cores at frequency X can be approximated as
C(p,X) ≈ O
(
n log nX2 + n
√
p+ n log n
1
pX
)
Since the equation obtained is very similar to the equation obtained for binary-exchange algorithm, same asymp-
totic result holds for the transpose algorithm.
123
Chapter 11
Shared-Memory based Algorithms
The chapter focuses on the current generation of multicore architectures, specifically, multicore processors which use
a hierarchical shared memory. The Parallel Random Access Machine (PRAM) provides an abstract model of shared
memory architectures [53]. However, the PRAM model has no notion of a memory hierarchy; i.e., PRAM does
not model the differences in the access speeds between the private cache on a core and the shared memory that is
addressable by all cores. Thus, PRAMs cannot accurately model the actual execution time of algorithms on modern
multicore architectures. More recently, several models emphasizing memory hierarchies have been proposed [11, 9, 8].
In particular, the Parallel External Memory (PEM) model is an extension of the PRAM model which includes a single
level of memory hierarchy [8]. A more general model is the Multicore model [11] which models multiple levels of
the memory hierarchy. In our analysis, we choose to use the PEM model. Our choice is motivated by the fact that the
PEM model is simpler, and we believe it is sufficient to illustrate the trade-offs that we are interested in analyzing.
11.1 Parallel Addition Algorithm
We illustrate our methodology using a simple parallel addition algorithm. Initially, all n numbers are stored contigu-
ously in the main memory and the caches of all p cores are empty. Without loss of generality, we assume that the input
size n is a multiple of the number of cores p. In the first phase of the algorithm, each core transfers (n/p) numbers
from memory to their own caches and computes their sum. The transfer and summation of (n/p) numbers by each
core happens in a series of steps. In each step, a core transfers a block of numbers B from main memory to its cache
and computes the sum of B numbers and the result obtained in the previous step. At the end of the first phase, each
of p cores possesses a partial sum. With access to a distinct additional auxiliary block of main memory by each core,
in the CREW PEM model, the sum of p partial sums is efficiently computed in parallel in a tree fashion in log(p)
steps (for simplicity, we assume p to be a power of 2). In the first step, half of the cores transfer their partial sums
in parallel to their respective auxiliary blocks in main memory. The other half of the cores then read in parallel the
elements that were stored in the auxiliary blocks of the first half, and sum it with their local partial sum. The same
step is recursively performed until there is only one core left. At the end of the computation, one core will store the
124
sum of all n numbers.
Figure 11.1 depicts the execution of the parallel addition algorithm for the case p = 4.
Figure 11.1: Example scenario: Adding n numbers using 4 cores; execution of 4th core represents the critical path
The sequential algorithm for this problem is trivial: it takes (n/B) memory accesses and n − 1 additions to
compute the sum of n numbers. The running time and energy of the sequential algorithm are given by Tseq =
β · (n − 1) · (1/F ) + (n/B)(Mc/F ) and Eseq = Edβ(n − 1)F 2 + Em(n/B) where Mc is the number of cycles
executed at maximum frequency for single shared memory access time and β represents number of cycles required
per addition
Energy Scalability under Iso-Performance
Now we describe the steps needed to evaluate the energy scalability under iso-performance. In the above algorithm,
the critical path is easy to find: it is the execution path of the core that has the sum of all numbers at the end (Step 1).
We can see that there are n/(B · p) + log(p) memory reads, log(p) synchronization breaks and ((n/p)− 1 + log(p))
computation steps (Step 2). The total number of computational cycles of the parallel algorithm evaluates to ((n/p −
1) · p + (p − 1)) · β i.e., (n − 1) · β (Step 3). We next evaluate the total number of memory accesses by the parallel
algorithm (Step 4). It is trivial to see that the number of memory accesses for this parallel algorithm when running on
p cores is (n/B) + 2 · (p− 1). Note that in this algorithm, the memory complexity is both dependent on p and on the
input size n.
Now, we obtain a reduced frequency at which all p cores should run to complete in time T (Step 5):
X = F ·
(np − 1 + log(p)) · β
T · F − ( nB·p + 2 · log(p)) ·Mc
(11.1)
In order to achieve energy savings, we require 0 < X < F . Note that this restriction provides a lower bound on the
125
input size as a function of p and Mc.
We now evaluate the total active time at all the cores, running at new frequency X (Step 6). The total active time
at all the cores at new frequency is given by p((n− 1)(β/F ) + (n/B)(Mc/F )).
We derive the equation for energy consumption (Step 7). The energy consumed for computation, memory accesses
and leakage while the algorithm is running on p cores at reduced frequency X is:
Ecomp = Ed · (n− 1) · β ·X2 (11.2)
Emem = Em · ((n/B) + 2 · (p− 1)) (11.3)
Eleak = El · Tactive ·X (11.4)
Finally, Step 8 involves the analysis of the equation obtained. While we could differentiate the function with
respect to the number of cores to compute the minimum, this results in a rather cumbersome expression. Instead, we
analyze the graphs expressing energy scalability under iso-performance. We later provide an asymptotic analysis for
the same.
Note that the energy expression is dependent on many variables such as n (input Size), p (number of cores), β
(number of cycles per addition), Mc (number of cycles executed at maximum frequency for single memory accesses
time), Em (energy consumed for single memory accesses), and F (maximum frequency of a core). We can simplify
these parameters without loss of generality. In most architectures, the number of cycles involved per addition is just
two (one cache transfer and one addition operation), so we assume β = 2. We also set leakage energy constant as
El = 1. We express all energy values with respect to this normalized energy value.
In order to graph the required differential, we must make some assumptions about the other parameters. While
these assumptions compromise generality, we discuss the sensitivity of the analysis to a range of values for these
parameters. One such parameter is the the energy consumed for a single cycle at maximum frequency as a multiple
of leakage energy constant. We assume this ratio to be 10, i.e., that Ed · F 2 = 10 · El. It turns out that this parameter
is not very significant for our analysis; in fact, large variations in the parameter do not affect the shapes of the graphs
significantly. Another parameter, k, represents the ratio of the energy consumed for single memory accesses, Em, and
the energy consumed for executing a single instruction at the maximum frequency. Thus, Em = k · Ed · F 2. We fix
the required performance T to be that of the running time of the sequential algorithm at maximum frequency F and
analyze the sensitivity of our results to a range of values of both k and Mc.
Fig. 11.2 plots the energy E as a function of the input size and number of cores. We can see that for any input
size n, initially energy decreases with increasing M and later on increases with increasing M . Similar to the message
passing algorithms in previous chapters, this behavior can be understood by the fact that the energy for computation
126
Figure 11.2: Addition: Energy curve with energy on the Z axis, number of cores on the X axis and input size on the Y
axis with k = 1000, β = 2, Mc = 1000. The black curve on the XY plane is the plot of the optimal number of cores
required for minimum energy consumption versus input size(107 to 109).
decreases with an increase in the number of cores running at reduced frequencies, and the energy for memory accesses
increases with an increase in the cores. Furthermore, we can see that increasing the input size leads to an increase in
the optimal number of cores required for minimum energy consumption.
Figure 11.3: Sensitivity analysis: optimal number of cores on the Y axis, and k (ratio of the energy consumed for
single memory accesses and the energy consumed for executing a single instruction at the maximum frequency) on
the X axis with input size n = 108 and Mc = 500.
We now consider the sensitivity of this analysis with respect to the ratio k. Fig. 11.3 plots the optimal number of
cores required for minimum energy consumption by varying k for an input size 108. The results show that for a given
input size in the range considered, the optimal number of cores required for minimum energy consumption decreases
with increasing k. (The curve approximates c/k, where c is some constant). We observe that this trend remain the
127
same for the entire input range (107 to 109).
The above graph analysis depicts the exact behavior of the optimal number of cores as function of problem size
for the given input range and appears to generalize to larger problem sizes. We now provide an analytic expression
for the asymptotic behavior of the optimal number of cores as a function of problem size. Note that, for the addition
algorithm, problem size (W) is same as the input size.
Asymptotic Analysis: Note that, if n/p  log p i.e., n  p · log p, then X ≈ F/p. Thus, the energy consumed by
the parallel addition algorithm running on p cores at frequency X is given by:
E = Edβ(n− 1)F
2
p2
+ Em · ((n/B) + 2 · (p− 1)) + El
(
(n− 1)β + nMc
B
)
(11.5)
The optimal number of cores required for minimum energy consumption is given by
popt = O
(
(n)
1/3
)
= O
(
(W )
1/3
)
Thus, the asymptotic energy scalability under iso-performance of the parallel addition algorithm is O(W 1/3). Note
that, n should be greater than p log p for this asymptotic result to apply.
Energy Bounded Scalability
We now evaluate the energy bounded scalability of the parallel addition algorithm. The total active time (Tactive)
at all the cores as a function of the frequency of the cores is given by
Tactive = p
((
n
Bp
+ 2 log(p)
)
Mc
F
+
(
n
p
− 1 + log(p)
)
β
1
X
)
Now, we frame an expression for energy consumption as a function of the frequency of the cores, using the energy
model. (Step 6). The energy consumed for computation, memory accesses and leakage while the algorithm is running
on p cores at frequency X are given by the following equations:
Ecomp = Ed · (n− 1) · β ·X2
Emem = Em · ((n/B) + 2 · (p− 1))
Eleak = El · Tactive ·X
where β is the number of cycles required per addition.
128
Given an energy budget E, the frequency X with which the cores should run (Step 7) is obtained by solving the
resulting quadratic equation: E = Ecomp + Ecomm + Eidle. The solution (frequency) is as follows:
X =
−b+√b2 + 4 · a · c
2 · a where
a = Ed · (n− 1) · β
b = El · Mc
F
( n
B
+ 2p log(p)
)
c = E − Em · ((n/B) + 2 · (p− 1))− Elβ (n− p+ p log(p))
The restriction that X > 0 provides an upper bound on the number of cores, as a function of n,E, F and Mc.
The time taken (inverse of performance) by the addition algorithm as a function of frequency of the cores X
(Step 8) is as follows:
Time Taken = (n/(B · p) + 2 log(p)) · Mc
F
+ ((n/p)− 1 + log(p)) · β · 1
X
We now analyze the time taken expression obtained above for the addition algorithm to evaluate the energy bound.
First, we analyze the graphs expressing energy bounded scalability. Later, we provide an asymptotic analysis for the
same. We use the same assumptions that were used earlier for analyzing the energy scalability under iso-performance
of the parallel addition algorithm. We assume the energy budget to be the energy consumed by the sequential algorithm
(Eseq) at maximum frequency F . Substituting Eseq for E and considering the restriction on X yields an upper bound
on the number of cores of O(n/k).
Fig. 11.4 plots performance i.e, time taken (inverse of performance) as a function of input size and number of
cores. We can see that for any input size n, initially the time taken by the algorithm decreases with increasing p and
later on increases with increasing p. As explained earlier, this behavior can be understood by the fact that the time
taken decreases with an increase in number of cores, and energy left out for computation (the difference between the
energy budget and the energy used for memory accesses) decreases with increasing cores (making cores run slower).
The behavior shows that the optimal number of cores required for maximum performance is of the order of the input
size. Furthermore, we see that increasing the input size leads to an increase in the optimal number of cores.
We now consider the sensitivity of this analysis with respect to the ratio k. Fig. 11.5 plots the optimal number of
cores required for maximum performance by fixing the input size and varying k. The plot shows that for a fixed input
size, the optimal number of cores required for maximum performance decreases with increasing k, approximating a
c/k curve where c is some constant. We observe that this trend remains the same for whole of the input range (108 to
129
Figure 11.4: Addition: performance curve with time taken on the Z axis, number of cores on the X axis and input size
on the Y axis with k = 100, β = 1, Mc = 500 and B = 1000. Time taken is plotted in units of 1/F where F is the
maximum frequency. Number of cores is plotted in units 9104. Input size is plotted from 3 × 108 to 109 in units of
108. The black curve on the XY plane is the plot of optimal number of cores required for maximum performance with
varying input size.
109).
The above graph analysis depicts the exact behavior of the optimal number of cores as function of problem size
for the given input range. We now provide an analytic expression for the asymptotic behavior of the optimal number
of cores as a function of problem size.
Asymptotic Analysis: Note that, If n p · log p, thenX ≈ F . Thus, the time taken by the parallel addition algorithm
running on p cores at frequency F is given by:
Time Taken = (n/(B · p) + 2 log(p)) · Mc
F
+ (n/p− 1 + log p)β 1
F
The optimal number of cores required for maximum performance under the energy budget is given by
popt = O (n)
= O(W )
Thus, the asymptotic energy bounded scalability of the parallel addition algorithm is O(W ). Note that, n should be
greater than p log p for this asymptotic result to apply.
130
Figure 11.5: Sensitivity analysis: optimal number of cores on the Y axis, and k (ratio of the energy consumed for a
single memory accesses to the energy consumed for executing a single instruction at the maximum frequency) on the
X axis with input size n = 108.
11.2 Parallel Prefix Sums
We now analyze parallel prefix sums algorithm that has an optimal I/O complexity under the PEM model [8]. The
prefix sums algorithm we consider is cache oblivious.
The sequential algorithm is very much similar to that of the addition algorithm except that we output all inter-
mediate sums (prefix sums) along the execution. Thus, it takes 2(n/B) memory accesses and n − 1 additions to
compute the all-prefix-sums of n numbers. The running time and energy of the sequential algorithm is given by
Tseq = β · (n− 1) · (1/F ) + 2(n/B) · (Mc/F ) and Eseq = Edβ(n− 1)F 2 + Em(n/B) respectively.
Given an ordered set A of n elements, the all-prefix-sums operation returns an ordered set B of n elements, such
that ith element in B is the sum of all elements of A whose index is less than or equal to i. For this specific problem,
a PRAM based algorithm is also an efficient PEM algorithm as it is, without any modifications [8]. Without loss
of generality, we assume n to be multiple of p and p to be some power of 2. The PRAM algorithm by Ladner and
Fischer [62] is as follows: First, every core sums n/p elements serially. At this stage, every core contains a partial sum
(there are p partial sums). Next, every odd numbered core sends its partial sum to its adjacent (right) even numbered
core using auxiliary blocks (as in the parallel addition algorithm). The even numbered cores then sum their partial
sum with the obtained partial sum (from odd numbered cores) yielding p/2 partial sums at p/2 cores. The algorithm
then recursively evaluates the prefix sums of the p/2 partial sums. Next, with the exception of the pth core, every even
numbered core sends its computed prefix sum to its adjacent (right) odd numbered core. Odd numbered cores (except
for first) then sum their partial sums with the obtained prefix sums to obtain their prefix-sums. Thus all-prefix-sums
of the initial p partial sums of p cores are computed. Finally, every core distributes its prefix sum across n/p elements
131
serially to obtain the all-prefix-sums of the ordered set A.
Energy Scalability under Iso-performance
Now we describe the steps needed to evaluate the energy scalability under iso-performance. Since p is some power
of 2, the critical path of the algorithm is the execution path of the pth core. (Step 1). We see that there are 2·n/(B ·p)+
2 · log(p)−1 memory reads, 2 · log(p)−1 synchronization breaks and (2 ·(n/p−1)+2 · log(p)−1) computation steps
(Step 2). The total number of computational cycles of the parallel algorithm is 2 ·((n/p−1) ·p+(2p− log(p)−2)) ·β,
which is (2n − log(p) − 2) · β (Step 3). We next observe that the total number of memory accesses required by the
parallel algorithm when running on p cores is 2 · (n/B) + 2 · (2p− log(p)− 2) (Step 4).
Now, we obtain a reduced frequency at which all p cores should run to complete in time T (Step 5):
X = F ·
(2 · (np − 1) + 2 · log(p)− 1) · β
T · F − (2 · nB·p + 4 · log(p)− 2) ·Mc
(11.6)
where β represents number of cycles required per addition or subtraction. In order to achieve energy savings, we
require 0 < X < F . Note that this restriction provides a lower bound on the input size as a function of p and Mc.
We now evaluate the total active time at all the cores, running at the new frequency X (Step 6). Since all the cores are
active all along the critical path, the total active time is:
Tactive =
(
2 · n
B · p + 4 · log(p)− 2
)
· Mc
F
· p
+
(
2 · (n
p
− 1) + 2 · log(p)− 1
)
· β
X
· p
We frame an equation for energy consumption (Step 7). The energy consumed for computation, memory accesses
and leakage while the algorithm is running on p cores at reduced frequency X is:
Ecomp = Ed · (2n− log(p)− 2) · β ·X2 (11.7)
Emem = Em ·
(
2 · n
B
+ 2 · (2p− log(p)− 2)
)
(11.8)
Eleak = El · Tactive ·X (11.9)
We use the same assumptions that were used earlier for analyzing the energy scalability of the parallel addition
algorithm. In particular, we assume the required performance to be the running time of the sequential algorithm at
maximum frequency F .
Fig. 11.6 shows that for any input size n, initially energy decreases with increasing p and later on increases
132
Figure 11.6: Prefix-Sums: Energy curve with energy on the Z axis, number of cores on the X axis and input size on
the Y axis with k = 1000, β = 2, Mc = 1000. The black curve on the XY plane is the plot of the optimal number of
cores required for minimum energy consumption with varying input size(107 to 109).
with increasing p. As explained earlier, this behavior can be understood by the fact that the energy for computation
decreases with an increase in the number of cores running at reduced frequencies, and the energy for memory accesses
increases with increasing cores. However, for the same valuations of the constants and the input range, optimal number
of cores for minimal energy consumption for parallel addition algorithms is far less compared to that of the parallel
prefix sum algorithm. Furthermore, we can see that increasing the input size leads to an increase in the optimal number
of cores required for minimum energy consumption.
Figure 11.7: Sensitivity analysis: optimal number of cores on Y axis, and k (ratio of the energy consumed for single
memory accesses and the energy consumed for executing a single instruction at the maximum frequency) on X axis
with input size n = 108 and Mc = 500.
We now consider the sensitivity of this analysis with respect to the ratio k. Fig. 11.7 plots the optimal number of
133
cores required for minimum energy consumption by fixing the input size (108) and varying k. The plot shows that for
a fixed input size, the optimal number of cores required for minimum energy consumption decreases with increasing
k. Furthermore, we observe that this trend remains the same for whole of the input range (107 to 109). Note that this
graph differs from the one obtained for the parallel addition algorithm.
The above graph analysis depicts the exact behavior of the optimal number of cores as function of problem size
for the given input range and appears to generalize to larger problem sizes. We now provide an analytic expression for
the asymptotic behavior of the optimal number of cores as a function of problem size. Note that, for the prefix sums
algorithm, problem size (W) is same as the input size.
Asymptotic Analysis: Note that, if n/p  log p i.e., n  p · log p, then X ≈ F/p. Thus, the energy consumed by
the parallel prefix sums algorithm running on p cores at frequency X is given by:
E = Edβ(2n− log(p)− 2)F
2
p2
+ Em
(
2
n
B
+ 2(2p− log(p)− 2)
)
+ El
(
β(n− 1) + 2nMc
B
)
(11.10)
The optimal number of cores required for minimum energy consumption is given by
popt =
= O
(
(n)
1/3
)
= O
(
(W )
1/3
)
Thus, the asymptotic energy scalability under iso-performance of the parallel prefix sums algorithm isO(W 1/3). Note
that, n should be greater than p log p for this asymptotic result to apply.
Energy bounded scalability
We now evaluate the energy bounded scalability of the parallel addition algorithm. The total active time (Tactive)
at all the cores as a function of the frequency of the cores is given by
Tactive =
(
2 · n
B · p + 4 · log(p)− 2
)
· p · Mc
F
+
(
2 · (n
p
− 1) + 2 · log(p)− 1
)
· p · β
X
where the first term represents the total active time spent by all the cores during memory transfers and the second term
represents the total active time spent by all cores performing computations.
134
Now, we frame an expression for energy consumption as a function of the frequency of the cores, using the energy
model. (Step 6). The energy consumed for computation, memory accesses and leakage while the algorithm is running
on p cores at frequency X are given by the following equations:
Ecomp = Ed · (2n− log(p)− 2) · β ·X2
Emem = Em · (2 · (n/B) + 2 · (2p− log(p)− 2))
Eleak = El · Tactive ·X
where β is the number of cycles required per addition.
Given an energy budget E, the frequency X with which the cores should run (Step 7) is obtained by solving the
resultant quadratic equation: E = Ecomp + Ecomm + Eidle. The solution (frequency) is as follows:
X =
−b+√b2 + 4 · a · c
2 · a where
a = Ed · (2n− log(p)− 2) · β
b = El ·
(
2 · n
B · p + 4 · log(p)− 2
)
· p · Mc
F
c = E − Em · (2 · (n/B) + 2 · (2p− log(p)− 2))
−El · β ·
(
2 · (n
p
− 1) + 2 · log(p)− 1
)
· p
The restriction that X > 0 provides an upper bound on the number of cores that can be used to increase the perfor-
mance, as a function of n,E, F and Mc.
The time taken (inverse of performance) by the parallel prefix sum algorithm as a function of frequency of the
cores X (Step 8) is as follows:
Time Taken = (2 · n/(B · p) + 4 · log(p)− 2) · Mc
F
+ (2 · (n/p− 1) + 2 · log(p)− 1) · β
X
We use the same assumptions that were used earlier for analyzing the energy-bounded scalability of the parallel
addition algorithm. In particular, we assume that the energy budget E to be the energy consumed by the sequential
algorithm at maximum frequency F . The energy required by the sequential algorithm is given by:
Eseq = Ed · β · (n− 1) · F 2 + Em · n
B
135
Figure 11.8: Prefixsums: performance curve with time taken on the Z axis, number of cores on the X axis and input
size on the Y axis with k = 100, β = 1, Mc = 500 and B = 1000. Time taken is plotted in units 1/F where F is the
maximum frequency. Number of cores is plotted in units of 9104. Input size is plotted from 3× 108 to 109 in units of
108. The black curve on the XY plane is the plot of optimal number of cores required for maximum performance with
varying input size.
Fig. 11.8 shows that for any input size n, the time taken is a a V-shaped function of increasing p: initially the time
taken by the algorithm decreases with increasing p, but as the values of p keep increasing, the time taken starts to
increase with p. This behavior is same as that of parallel addition algorithm. However, for the same valuations of the
constants and the input range, optimal number of cores for maximum performance for parallel addition algorithms is
far less compared to that of the parallel prefix sum algorithm. Furthermore, we can see that increasing the input size
leads to an increase in the optimal number of cores required for maximum performance.
We now consider the sensitivity of this analysis with respect to the ratio k. Fig. 11.9 plots the optimal number of
cores required for minimum energy consumption by fixing the input size (108) and varying k. The plot shows that
for a fixed input size, the optimal number of cores required for maximum performance decreases with increasing k.
Furthermore, we observe that this trend remains the same for whole of the input range (107 to 109). Note that this
structure of the sensitivity curve is very similar to that of the parallel addition algorithm. Looking at the energy and
the sensitivity curves, we conjecture that asymptotic nature of the required metric is same for both the algorithms.
The above graph analysis depicts the exact behavior of the optimal number of cores as function of problem size
for the given input range. We now provide an analytic expression for the asymptotic behavior of the optimal number
of cores as a function of problem size.
Asymptotic Analysis: Note that, If n  p · log p, then X ≈ F . Thus, the time taken by the parallel prefix sums
136
Figure 11.9: Sensitivity analysis: optimal number of cores on Y axis, and k (ratio of the energy consumed for single
memory accesses and the energy consumed for executing a single instruction at the maximum frequency) on X axis
with input size n = 108.
algorithm running on p cores at frequency F is given by:
Time Taken =
(
2 · n
B · p + 4 · log(p)− 2
)
Mc
F
+
(
2 · (n
p
− 1) + 2 · log(p)− 1
)
β
F
The optimal number of cores required for maximum performance under the energy budget is given by
popt = O(n)
= O(W )
Thus, the asymptotic energy bounded scalability of the parallel prefix sums algorithm is O(W ). Note that, n should
be greater than p log p for this asymptotic result to apply.
11.3 Parallel Merge Sort
We consider a pipelined d-way mergesort algorithm developed by Lars Arge et al. for the PEM model [8]. It is similar
to the sorting algorithm of Goodrich [36].
A d-way mergesort partitions the input into d subsets, sorts each subset recursively and then merges them. To
achieve optimal parallel speedup, the sorted subsets are sampled and these sample sets are merged first. Each level of
the recursion is performed in multiple rounds with each round producing progressively finer samples until eventually a
137
list of samples is the whole sorted subset of the corresponding level of recursion. The samples retain information about
the relative order of the other elements of the set through rankings. These rankings allow for a quick merge of future
finer samples at higher levels of recursion. Each round is pipelined up the recursion tree to maximize parallelism (see
[8] for details).
Theorem 11.3.1 (Lars Arge et al. [8]) Given a set S of n items stored contiguously in memory, one can sort S in
CREW PEM model using p ≤ n/B2 processors each having a private cache of size M = BO(1) in O( np·B logMB
n
B )
parallel memory accesses, O(np log n) internal computational complexity per processor and O(n) total memory.
Energy Scalability under Iso-Performance
Recall that the best known sequential algorithm for the mergesort algorithm in the external memory model (EM)
takes O( nB logMB
n
B ) memory accesses and O(n log n) computational steps [3]. Considering the assumptions of The-
orem 11.3.1, we now evaluate energy scalability under iso-performance of the parallel mergesort algorithm assuming
the required performance to be that of the sequential algorithm. Since all cores are active until the end of the com-
putation, the critical path of the algorithm is the execution sequence on any one of the cores. By Theorem 11.3.1,
critical path comprises of O( np·B logMB
n
B ) memory accesses and O(
n
p log n) computation steps (Step 2). Note that
the reduced frequency at which all p cores should run to complete in time Tseq decreases with p.
Since each core performs an equal amount of computation, the total number of computational cycles of the parallel
algorithm isO(n log(n)) (not dependent on p). Thus by Equation 3.3, the energy consumed for computationEcomp by
the algorithm decreases with p. However, since the total number of memory accesses at all p cores is O( nB logMB
n
B )
(also not dependent on p), the energy consumed for memory accesses Emem does not vary with increasing cores. Fur-
ther, by Equation 3.4, the leakage energy Eleak dissipated at the cores running at the reduced frequency also decreases
with p. Using the above three observations, we see that the energy consumed by the parallel algorithm to maintain
the same performance as the sequential algorithm decreases with increasing cores under the restriction p ≤ n/B2.
One could easily generalize the above observation to a general class of parallel algorithms which possesses optimal
computational cost and optimal I/O complexity.
Energy Bounded Scalability
Considering the assumptions of Theorem 11.3.1, we now evaluate the energy-bounded scalability of the parallel
mergesort algorithm assuming the required performance to be that of the sequential algorithm. Since all cores are
active until the end of computation, the critical path of the algorithm is the execution sequence on any one of the cores.
By Theorem 11.3.1, critical path comprises of O( np·B logMB
n
B ) memory accesses and O(
n
p log n) computation steps
(Step 2). We see that the total number of memory accesses at all p cores, which is O( nB logMB
n
B ), not dependent on p.
Furthermore, Since each core performs an equal amount of computation, the total number of computational cycles and
138
total active time of the parallel algorithm, which evaluate to O(n log(n)) and O( nB logMB
n
B ) · McF + O(n log(n)) 1X
respectively , are also not dependent on p. Thus, given an energy budget E, the frequency X with which the cores
should run, obtained by solving the quadratic equation: E = Ecomp + Ecomm + Eidle, is not dependent on the
number of cores (p). Using the above observations and Equation 5.6, we see that the time taken by the parallel
algorithm, under a fixed energy budget, decreases with increasing p under the restriction p ≤ nB2 . One could easily
generalize the above observation to a general class of parallel algorithms which possesses optimal computational cost
and optimal IO complexity.
11.4 Summary
We observe that the performance and energy behavior of algorithms in the shared memory model that we use is
significantly different from the behavior of comparable algorithms in message-passing architectures (see our earlier
analysis for message-passing architectures). This is because in this work, we model communication as well as local
accesses through a hierarchical shared memory. In fact, the results in this paper would be similar if we were to replace
communication through shared memory with message passing, while modeling the local memory hierarchy at each
core.
Because of memory bottlenecks, as we observed ear lier, shared memory architectures cannot scale arbitrarily [77].
Thus an asymptotic analysis of algorithms on a shared memory model would not necessarily be very insightful.
139
Chapter 12
Conclusion
The central thesis of this dissertation is that examining the relation between the performance of parallel applications
and their energy requirements on parallel processors can be facilitated by analyzing a set of metrics, each for a different
purpose. Towards this goal, we introduced four energy-performance trade-off metrics namely: (a) energy consumption
under fixed performance for energy conservation in time constrained applications, (b) energy bounded performance
for improving performance in energy constrained applications, (c) energy efficiency for energy efficient computing,
and (e) cost metric for reducing the monetary cost associated with running the application.
We have considered the optimization problems (corresponding to the four metrics) for problem instances, repre-
sented as DAGs, of parallel applications. Using sophisticated constraint solvers like SMT solvers, we have been able
to find optimal schedules (task placement and frequency selection) for real world parallel application DAG structures.
Here, we also evaluated the sensitivity of these optimal schedules to the architectural parameters such as the ratio of
energy consumed for single message communication to the energy consumed for single computation. Furthermore,
we have studied the optimal schedules of the metrics for their sensitivity towards various DAG parameters such as
density, width and regularity.
We have also extended the traditional notion of scalability to our energy-performance trade-off metrics and pro-
posed corresponding scalability metrics. We proposed methodologies to evaluate the scalability metrics, and illustrated
them by analyzing different genre of algorithms such as parallel addition algorithm, dense matrix algorithms, sorting
algorithms, graph based algorithms, and fast Fourier transform algorithms for the message passing model; and addi-
tion, prefix sums and Cole’s mergesort for the shared memory model. Because of memory bottlenecks, shared memory
architectures cannot scale arbitrarily [77]. Hence, an asymptotic scalability analysis of algorithms on a shared memory
model would not necessarily be very insightful.
The scalability results of parallel algorithms on the message passing model is summarized in table 12.1. We can
observe that both the matrix transpose algorithms are not scalable i.e, optimal number of cores for metric optimizations
does not change with the problem size. The other interesting observation is that all the sorting algorithms considered
are less scalable compared to other genres of parallel algorithms. Moreover, in the case of sorting algorithms, three
algorithms namely bitonic sort, efficient quicksort and sample sort have the same and better scalability characteristics
140
Parallel Algorithm Energy scalability Energy bounded Utility based
under iso-performance scalability scalability
Parallel Addition O
((
W
logW
)1/3)
O
(
W
2
3
)
O
(
W
4
5
(logW )
3
5
)
Matrix Transpose (checkerboard) O(1) O(1) O(1)
Matrix Transpose (rowwise) O(1) O(1) O(1)
Matrix Vector Multiplication (checkerboard) O
(
W
1
5
(logW )
2
5
)
O
(
W
2
3
)
O
(
W
3
7
(logW )
6
7
)
Matrix Vector Multiplication (rowwise) O
(
W
1
6
)
O
(
W
2
3
)
O
(
W
3
10
)
Matrix Multiplication O
(
W
2
15
)
O
(
W
2
3
)
O
(
W
2
7
)
Bitonic Sort O
(
(logW )
2
5
)
O
(
(logW )2
)
O
(
(logW )
6
7
)
Odd-even Transposition Sort O
(
(logW )
1
3
)
O (logW ) O
(
(logW )
3
5
)
Naive Quicksort O
((
logW
log logW
) 2
5
)
O
(
(logW )
2
3
)
O
((
logW
log logW
) 6
7
)
Efficient Quicksort O
(
(logW )
2
5
)
O
(
(logW )2
)
O
(
(logW )
6
7
)
Sample Sort O
(
(logW )
2
5
)
O
(
(logW )2
)
O
(
(logW )
6
7
)
Minimum Spanning Tree O
(
W
1
6
(logW )
1
3
)
O
(
W
1
3
)
O
(
W
3
10
(logW )
3
5
)
Single-Source Shortest Paths O
(
W
1
6
(logW )
1
3
)
O
(
W
1
3
)
O
(
W
3
10
(logW )
3
5
)
All-Pairs Shortest Paths O
(
W
2
15
(logW )
2
5
)
O
(
W
4
9
)
O
(
W
2
7
(logW )
6
7
)
Fast Fourier Transform O
(
(logW )
2
5
)
O
(
(logW )2
)
O
(
(logW )
6
7
)
Table 12.1: Scalability metrics of parallel algorithms on 2D mesh interconnect
than the other sorting algorithms.
The work presented in this thesis could be extended in many ways. Currently, the technique used for comput-
ing optimal schedules for metric optimization cannot scale for DAG applications with more than 30-40 tasks. The
scalability can be improved in two ways. First, SMT solvers specifically designed for scheduling problems can be
used. In particular, SMT solvers can be tweaked to improve performance for scheduling problem instances. SMT
solvers targeting specific kind of applications have been constructed [30]. Constructing SMT solvers with a focus on
scheduling problems is an open problem. Second, efficient encoding of problem along with abstraction can be used to
scale the process of computing the optimal schedules. There has been recent development of using abstraction in SAT
solving [26] and bit vector arithmetic solving [12]. It would be interesting to see how abstraction techniques can be
used to scale the process of solving scheduling problem instances.
In this thesis, we made a few simplifying architectural assumptions to simplify the energy-performance trade-
off analysis of parallel algorithms. We assumed that the cores are homogeneous, but given the limitations imposed
on parallel performance by Amdahls law, cores are likely to be heterogeneous. Currently, It is not clear on how to
141
analyze algorithms for energy-performance trade-offs under a heterogeneity model of the parallel machine. Another
assumption we make is that the communication time cannot be scaled. It would interesting to consider more complex
communication models where both bandwidth and latency can be scaled by the application, and there by providing
energy-performance trade-off for the message transfers.
Currently, in our scalability analysis of parallel algorithms, specifically in the case of energy scalability under
iso-performance and energy bounded scalability, we fix the energy and performance budget to be that of the best
sequential algorithm. It would be interesting to consider the generic case where both energy and performance budgets
are some arbitrary functions of the input size. This analysis might provide insights on how our energy scalability
metrics depend on the corresponding budgets.
142
References
[1] Alok Aggarwal, Ashok K. Chandra, and Marc Snir. On communication latency in pram computations. In SPAA,
pages 11–21, 1989.
[2] Alok Aggarwal, Ashok K. Chandra, and Marc Snir. Communication complexity of prams. Theor. Comput. Sci.,
71(1), 1990.
[3] Alok Aggarwal and S. Vitter, Jeffrey. The input/output complexity of sorting and related problems. Commun.
ACM, 31(9):1116–1127, 1988.
[4] G. Agha and W. Kim. Parallel programming and complexity analysis using actors. Massively Parallel Program-
ming Models, 0:68, 1997.
[5] Susanne Albers and Hiroshi Fujiwara. Energy-efficient algorithms for flow time minimization. ACM Transac-
tions on Algorithms, 3(4), 2007.
[6] George Alfs and Nick Knupffer. Intel corporation’s multicore architecture briefing. Intel Press Room, 2008.
[7] Murali Annavaram, Ed Grochowski, and John Paul Shen. Mitigating amdahl’s law through epi throttling. In
ISCA, pages 298–309, 2005.
[8] Lars Arge, Michael T. Goodrich, Michael J. Nelson, and Nodari Sitchinava. Fundamental parallel algorithms for
private-cache chip multiprocessors. In SPAA, pages 197–206, 2008.
[9] Michael A. Bender, Jeremy T. Fineman, Seth Gilbert, and Bradley C. Kuszmaul. Concurrent cache-oblivious
b-trees. In SPAA, pages 228–237, 2005.
[10] Brad D. Bingham and Mark R. Greenstreet. Computation with energy-time trade-offs: Models, algorithms and
lower-bounds. In ISPA, pages 143–152, 2008.
[11] Guy E. Blelloch, Rezaul Alam Chowdhury, Phillip B. Gibbons, Vijaya Ramachandran, Shimin Chen, and
Michael Kozuch. Provably good multicore cache performance for divide-and-conquer algorithms. In SODA,
pages 501–510, 2008.
[12] Randal E. Bryant, Daniel Kroening, Joe¨l Ouaknine, Sanjit A. Seshia, Ofer Strichman, and Bryan A. Brady. An
abstraction-based decision procedure for bit-vector arithmetic. STTT, 11(2):95–104, 2009.
[13] Kirk W. Cameron, Kirk Pruhs, Sandy Irani, Partha Ranganathan, and David Brooks. Report of the science of
power management workshop. Workshop on the Science of Power Management, 2009.
[14] Haijun Cao, Hai Jin, Xiaoxin Wu, Song Wu, and Xuanhua Shi. Dagmap: Efficient scheduling for dag grid
workflow job. In GRID, pages 17–24, 2008.
[15] Ho-Leung Chan, Jeff Edmonds, and Kirk Pruhs. Speed scaling of processes with arbitrary speedup curves on a
multiprocessor. In SPAA, pages 1–10, 2009.
[16] AP Chandrakasan, S. Sheng, and RW Brodersen. Low-Power CMOS Digital Design. IEEE Journal of Solid-State
Circuits, 27(4):473–484, 1992.
143
[17] S. Cho and R. Melhem. Corollaries to Amdahl’s Law for Energy. Computer Architecture Letters, 7(1):25–28,
2008.
[18] Wu chun Feng, Avery Ching, and Chung-Hsing Hsu. Green supercomputing in a desktop box. In IPDPS, pages
1–8, 2007.
[19] Wu chun Feng and Thomas Scogland. The green500 list: Year one. In IPDPS, pages 1–7, 2009.
[20] Richard Cole and Ofer Zajicek. The apram: Incorporating asynchrony into the pram model. In SPAA, pages
169–178, 1989.
[21] D. Culler, R. Karp, D. Patterson, A. Sahay, K.E. Schauser, E. Santos, R. Subramonian, and T. Von Eicken. LogP:
Towards a Realistic Model of Parallel Computation. ACM SIGPLAN Notices, 28(7):1–12, 1993.
[22] M. Curtis-Maury, A. Shah, F. Blagojevic, D.S. Nikolopoulos, B.R. de Supinski, and M. Schulz. Prediction
Models for Multi-dimensional Power-Performance Optimization on Many Cores. In PACT, pages 250–259,
2008.
[23] Sekhar Darbha and Dharma P. Agrawal. Optimal scheduling algorithm for distributed-memory machines. IEEE
Trans. Parallel Distrib. Syst., 9(1):87–96, 1998.
[24] Leonardo Mendonc¸a de Moura and Nikolaj Bjørner. Z3: An efficient smt solver. In TACAS, pages 337–340,
2008.
[25] Derek L. Eager, Edward D. Lazowska, and John Zahorjan. Adaptive load sharing in homogeneous distributed
systems. IEEE Trans. Softw. Eng., 12:662–675, May 1986.
[26] Niklas Ee´n, Alan Mishchenko, and Nina Amla. A single-instance incremental sat formulation of proof- and
counterexample-based abstraction. In FMCAD, pages 181–188, 2010.
[27] E. N. Elnozahy, Michael Kistler, and Ramakrishnan Rajamony. Energy conservation policies for web servers. In
USENIX Symposium on Internet Technologies and Systems, 2003.
[28] E. N. Elnozahy, Michael Kistler, and Ramakrishnan Rajamony. Energy-efficient server clusters. In Proceedings
of the 2nd international conference on Power-aware computer systems, PACS’02, pages 179–197, 2003.
[29] Steven Fortune and James Wyllie. Parallelism in random access machines. In STOC, pages 114–118, 1978.
[30] Vijay Ganesh and David L. Dill. A decision procedure for bit-vectors and arrays. In CAV, pages 519–531, 2007.
[31] R. Ge, X. Feng, and K.W. Cameron. Performance-Constrained Distributed DVS Scheduling for Scientific Ap-
plications on Power-Aware Clusters. In SC. IEEE Computer Society Washington, DC, USA, 2005.
[32] Soraya Ghiasi. Aide de camp: asymmetric multi-core design for dynamic thermal management. PhD thesis,
University of Colorado at Boulder, Boulder, CO, USA, 2004. AAI3136618.
[33] Phillip B. Gibbons. A more practical pram model. In SPAA, pages 158–168, 1989.
[34] Phillip B. Gibbons, Yossi Matias, and Vijaya Ramachandran. The qrqw pram: accounting for contention in
parallel algorithms. In Proceedings of the fifth annual ACM-SIAM symposium on Discrete algorithms, pages
638–648, 1994.
[35] Ricardo Gonzalez and Mark A. Horowitz. Energy dissipation in general purpose microprocessors. IEEE Journal
of Solid-State Circuits, 31:12771284, 1995.
[36] Michael T. Goodrich. Communication-efficient parallel sorting. SIAM J. Comput., 29(2):416–432, 1999.
[37] Ian Gorton, Paul Greenfield, Alexander S. Szalay, and Roy Williams. Data-intensive computing in the 21st
century. IEEE Computer, 41(4):30–32, 2008.
144
[38] Ed Grochowski, Ronny Ronen, John Paul Shen, and Hong Wang 0003. Best of both latency and throughput. In
ICCD, pages 236–243, 2004.
[39] Seongmoo Heo and Krste Asanovic. Power-optimal pipelining in deep submicron technology. In ISLPED, pages
218–223, 2004.
[40] Chung-Hsing Hsu and Wu chun Feng. A feasibility analysis of power awareness in commodity-based high-
performance clusters. In CLUSTER, pages 1–10, 2005.
[41] Chung-Hsing Hsu and Wu chun Feng. A power-aware run-time system for high-performance computing. In SC,
page 1, 2005.
[42] Sascha Hunold. http://www.loria.fr/ suter/dags.html.
[43] Greenpeace International. http://www.greenpeace.org/raw/content/international/press/reports/make-it-green-
cloud-computing.pdf, 2010.
[44] Sandy Irani, Sandeep K. Shukla, and Rajesh K. Gupta. Online strategies for dynamic power management in
systems with multiple power-saving states. ACM Trans. Embedded Comput. Syst., 2(3):325–346, 2003.
[45] C. Isci, A. Buyuktosunoglu, C.Y. Cher, P. Bose, and M. Martonosi. An Analysis of Efficient Multi-Core Global
Power Management Policies: Maximizing Performance for a Given Power Budget. In MICRO, volume 9, pages
347–358, 2006.
[46] Prasad Jogalekar and C. Murray Woodside. Evaluating the scalability of distributed systems. IEEE Trans.
Parallel Distrib. Syst., 11(6):589–603, 2000.
[47] Philo Juang, Qiang Wu, Li-Shiuan Peh, Margaret Martonosi, and Douglas W. Clark. Coordinated, distributed,
formal energy management of chip multiprocessors. In ISLPED, pages 127–130, 2005.
[48] Ismail Kadayif, Mahmut T. Kandemir, and Ibrahim Kolcu. Exploiting processor workload heterogeneity for
reducing energy consumption in chip multiprocessors. In DATE, pages 1158–1163, 2004.
[49] Ismail Kadayif, Mahmut T. Kandemir, and Ugur Sezer. An integer linear programming based approach for
parallelizing applications in on-chip multiprocessors. In DAC, pages 703–708, 2002.
[50] Jaeyeon Kang and Sanjay Ranka. Dvs based energy minimization algorithm for parallel machines. In IPDPS,
pages 1–12, 2008.
[51] Jaeyeon Kang and Sanjay Ranka. Dynamic slack allocation algorithms for energy minimization on parallel
machines. J. Parallel Distrib. Comput., 70(5):417–430, 2010.
[52] Krishna Kant. Toward a science of power management. Computer, 42(9):99 –101, Sept. 2009.
[53] M. Karp and V. Ramachandran. Parallel algorithms for shared-memory machines. Handbook of Theoretical
Computer Science, pages 869–941, 1990.
[54] David King, Ishfaq Ahmad, and Hafiz Fahad Sheikh. Stretch and compress based re-scheduling techniques
for minimizing the execution times of dags on multi-core processors under energy constraints. International
Conference on Green Computing, 0:49–60, 2010.
[55] V.A. Korthikanti and G. Agha. Avoiding energy wastage in parallel applications. In Green Computing Confer-
ence, 2010 International, pages 149 –163, 2010.
[56] Vijay Anand Korthikanti and Gul Agha. Analysis of Parallel Algorithms for Energy Conservation in Scalable
Multicore Architectures. In International Conference on Parallel Processing (ICPP), 2009.
[57] Vijay Anand Korthikanti and Gul Agha. Energy bounded scalability analysis of parallel algorithms. Technical
Report, Department of Computer Science, University of Illinois at Urbana Champaign, 2009.
145
[58] Vijay Anand Korthikanti and Gul Agha. Towards optimizing energy costs of algorithms for shared memory
architectures. In SPAA, 2010.
[59] Rakesh Kumar, Keith I. Farkas, Norman P. Jouppi, Parthasarathy Ranganathan, and Dean M. Tullsen. Single-isa
heterogeneous multi-core architectures: The potential for processor power reduction. In MICRO, pages 81–92,
2003.
[60] Rakesh Kumar, Dean M. Tullsen, Parthasarathy Ranganathan, Norman P. Jouppi, and Keith I. Farkas. Single-isa
heterogeneous multi-core architectures for multithreaded workload performance. In ISCA, pages 64–75, 2004.
[61] V. Kumar, A. Grama, A. Gupta, and G. Karypis. Introduction to Parallel Computing: Design and Analysis of
Algorithms. Benjamin-Cummings, 1994.
[62] Richard E. Ladner and Michael J. Fischer. Parallel prefix computation. J. ACM, 27(4):831–838, 1980.
[63] Megan Langer. Demonstrations of the 48-core research prototype. Intel Press Room, 2009.
[64] Young Choon Lee and Albert Y. Zomaya. Minimizing energy consumption for precedence-constrained applica-
tions using dynamic voltage scaling. In CCGRID, pages 92–99, 2009.
[65] J. Li and J.F. Martinez. Power-Performance Considerations of Parallel Computing on Chip Multiprocessors.
ACM Transactions on Architecture and Code Optimization, 2(4):1–25, 2005.
[66] J. Li and J.F. Martinez. Dynamic Power-Performance Adaptation of Parallel Computation on Chip Multiproces-
sors. In HPCA, pages 77–87, 2006.
[67] Jian Li and Jose´ F. Martı´nez. Dynamic power-performance adaptation of parallel computation on chip multipro-
cessors. In HPCA, pages 77–87, 2006.
[68] Jian Li, Jose´ F. Martı´nez, and Michael C. Huang. The thrifty barrier: Energy-aware synchronization in shared-
memory multiprocessors. In HPCA, pages 14–23, 2004.
[69] Yingmin Li, Benjamin C. Lee, David Brooks, Zhigang Hu, and Kevin Skadron. Cmp design space exploration
subject to physical constraints. In HPCA, pages 17–28, 2006.
[70] Min Yeol Lim and Vincent W. Freeh. Determining the minimum energy consumption using dynamic voltage
and frequency scaling. In IPDPS, pages 1–8, 2007.
[71] Yishay Mansour, Noam Nisan, and Uzi Vishkin. Trade-offs between communication throughput and parallel
time. J. Complexity, 15(1):148–166, 1999.
[72] Alain J. Martin. Towards an energy complexity of computation. Information Processing Letters, 39:181187,
2001.
[73] Kurt Mehlhorn and Uzi Vishkin. Randomized and deterministic simulations of prams by parallel machines with
restricted granularity of parallel memories. Acta Inf., 21:339–374, 1984.
[74] M. P. Mills. The internet begins with coal. Green Earth Society, USA, 1999.
[75] Andreas Moshovos, Gokhan Memik, Babak Falsafi, and Alok N. Choudhary. Jetty: Filtering snoops for reduced
energy consumption in smp servers. In HPCA, pages 85–96, 2001.
[76] Abdellatif Mtibaa, Bouraoui Ouni, and Mohamed Abid. An efficient list scheduling algorithm for time placement
problem. Computers & Electrical Engineering, 33(4):285–298, 2007.
[77] R. Murphy. On the effects of memory latency and bandwidth on supercomputer application performance. IISWC,
pages 35–43, Sept. 2007.
[78] Daniel Nussbaum and Anant Agarwal. Scalability of parallel machines. Commun. ACM, 34(3):57–61, 1991.
146
[79] John Oliver, Ravishankar Rao, Paul Sultana, Jedidiah R. Crandall, Erik Czernikowski, Leslie W. Jones IV, Diana
Franklin, Venkatesh Akella, and Frederic T. Chong. Synchroscalar: A multiple clock domain, power-aware,
tile-based embedded processor. In ISCA, pages 150–161, 2004.
[80] S. Park, W. Jiang, Y. Zhou, and S. Adve. Managing Energy-Performance Tradeoffs for Multithreaded Applica-
tions on Multiprocessor Architectures. In SIGMETRICS, pages 169–180. ACM New York, NY, USA, 2007.
[81] Karthick Rajamani and Charles Lefurgy. On evaluating request-distribution schemes for saving energy in server
clusters. In ISPASS, pages 111–122, 2003.
[82] Barry Rountree, David K. Lowenthal, Bronis R. de Supinski, Martin Schulz, Vincent W. Freeh, and Tyler K.
Bletsch. Adagio: making dvs practical for complex hpc applications. In ICS, pages 460–469, 2009.
[83] Barry Rountree, David K. Lowenthal, Bronis R. de Supinski, Martin Schulz, Vincent W. Freeh, and Tyler K.
Bletsch. Adagio: making dvs practical for complex hpc applications. In ICS, pages 460–469, 2009.
[84] Barry Rountree, David K. Lowenthal, Shelby Funk, Vincent W. Freeh, Bronis R. de Supinski, and Martin Schulz.
Bounding energy consumption in large-scale mpi programs. In SC, page 49, 2007.
[85] Barry Rountree, David K. Lowenthal, Shelby Funk, Vincent W. Freeh, Bronis R. de Supinski, and Martin Schulz.
Bounding energy consumption in large-scale mpi programs. In SC, page 49, 2007.
[86] Vivek Sarkar. Partitioning and Scheduling Parallel Programs for Multiprocessors. MIT Press, Cambridge, MA,
USA, 1989.
[87] V. Singh, V. Kumar, G. Agha, and C. Tomlinson. Scalability of Parallel Sorting on Mesh Multicomputer. In
Parallel Processing: 5th International Symposium: Papers., volume 51, page 92. IEEE, 1991.
[88] Xian-He Sun and Lionel M. Ni. Scalable problems and memory-bounded speedup. J. Parallel Distrib. Comput.,
19(1):27–37, 1993.
[89] Haluk Topcuoglu, Salim Hariri, and Min-You Wu. Performance-effective and low-complexity task scheduling
for heterogeneous computing. IEEE Trans. Parallel Distrib. Syst., 13(3):260–274, 2002.
[90] Leslie G. Valiant. A bridging model for parallel computation. Commun. ACM, 33(8):103–111, 1990.
[91] Leslie G. Valiant. General purpose parallel architectures. In Handbook of Theoretical Computer Science, Volume
A: Algorithms and Complexity (A), pages 943–972, 1990.
[92] Lizhe Wang, Gregor von Laszewski, Jai Dayal, and Fugang Wang. Towards energy aware scheduling for prece-
dence constrained parallel tasks in a cluster with dvfs. In CCGRID, pages 368–377, 2010.
[93] X. Wang and S.G. Ziavras. Performance-Energy Tradeoffs for Matrix Multiplication on FPGA-Based Mixed-
Mode Chip Multiprocessors. In ISQED, pages 386–391, 2007.
[94] Adam Wierman, Lachlan L. H. Andrew, and Ao Tang. Power-aware speed scaling in processor sharing systems.
In INFOCOM, pages 2007–2015, 2009.
[95] Albert Y. Zomaya, Chris Ward, and Benjamin S. Macey. Genetic scheduling for parallel processor systems:
Comparative studies and performance issues. IEEE Trans. Parallel Distrib. Syst., 10(8):795–812, 1999.
147
