A figure of merit for describing the performance of scaling of
  parallelization by Végh, János et al.
ar
X
iv
:1
60
6.
02
68
6v
3 
 [c
s.P
F]
  2
2 J
ul 
20
16
A figure of merit for describing
the performance of scaling of parallelization
Ja´nos Ve´gha, Pe´ter Molna´rb, Jo´zsef Va´sa´rhelyia
aUniversity of Miskolc, Hungary
bPhD School of Informatics, University of Debrecen, Hungary
Abstract
With the spread of multi- and many-core processors more and more typical
task is to re-implement some source code written originally for a single pro-
cessor to run on more than one cores. Since it is a serious investment, it is
important to decide how much efforts pays off, and whether the resulting im-
plementation has as good performability as it could be. The Amdahl’s law
provides some theoretical upper limits for the performance gain reachable
through parallelizing the code, but it needs the detailed architectural knowl-
edge of the program code, does not consider the housekeeping activity needed
for parallelization and cannot tell how the actual stage of parallelization im-
plementation performs. The present paper suggests a quantitative measure
for that goal. This figure of merit is derived experimentally, from measured
running times, and number of threads/cores. It can be used to quantify the
used parallelization technology, the connection between the computing units,
the acceleration technology under the given conditions, or the performance
of the software team/compiler.
Keywords: multi-core, parallelization, performance, scaling, figure of merit
1. Introduction
The computer manufacturing technology is not any more able to produce
quicker processors, see Agarwal et al. (2000). The crisis of the computing
Email addresses: J.Vegh@uni-miskolc.hu (Ja´nos Ve´gh), pmolnar@lib.unideb.hu
(Pe´ter Molna´r), vajo@mazsola.iit.uni-miskolc.hu (Jo´zsef Va´sa´rhelyi)
Preprint submitted to JPDC July 7, 2018
experienced since cca. year 2000, see Fuller and Millett (2011), increased the
demand towards parallel computing.
On hardware side: ”Processor and network architectures are making rapid
progress with more and more cores being integrated into single processors and
more and more machines getting connected with increasing bandwidth. Pro-
cessors become heterogeneous and reconfigurable.”, see S(o)OS project (2010).
On software side: ”parallel programs ... are notoriously difficult to write, test,
analyze, debug, and verify, much more so than the sequential versions”, see
Yang et al. (2014). In addition, the typical real-life programs show com-
plex parallelization behavior, see Yavits et al. (2014), and also the appar-
ently massively parallel algorithms can behave extremely ineffectively, see
Pingali et al. (2011). Different merits are derived for characterizing paral-
lel systems, from simple speedup to price per performance Karp and Flatt
(1990).
Because all of these difficulties, demands and possibilities, uncountable
developments are running and planned to re-implement some existing code,
written having a single processor in mind, for the already ubiquitos multi-
core processors. To find out whether it is worth to invest in such efforts for
parallelization, as well as to decide where to stop the development, one needs
some quantitative measure of the parallelism. Even today, in the ”multicore
era”, see Hill and Marty (2008), the performance is described by (a mod-
ified version of) Amdahl’s law, see Amdahl, G. M. (1967). Unfortunately,
Amdahl’s law provides no support for the goals mentioned: it needs infor-
mation on code architecture, makes assumptions which are not any more
valid for modern accelerated processors and applications heavily using op-
erating system services. However, scrutinizing the conditions and reverting
the Amdahl’s formula, it is possible to derive such a figure of merit.
2. LIMITATIONS OF PARALLELIZATION
2.1. Amdahl’s law
Amdahl’s considerations focus on the fact that some parts (Pi) of a code
can be parallelized, some (Si) must remain sequential. He also mentioned
that data housekeeping causes some overhead, which in his paper was esti-
mated to be in the range 20% to 40%, and that the nature of this overhead
appears to be sequential.
Although Amdahl just wanted to draw the attention to the limitations of
the single-processor approach applied to large-scale computing, his followers
2
A) Classic Amdahl case
Serial
P roc
T
im
e(
a
rb
.
u
n
it
s)
P0
0
1
2
3
4
5
6
7
8
9
10
S
1
P
1
P
2
P
3
S
2
Parallel
P0 P1 P2 P3 P4
S
1
P
1
P
2
P
3
S
2
Ttotal =
∑
i Si +
∑
i Pi
α =
∑
i
Pi∑
i
Si+
∑
i
Pi)
= 0.75
S−1 = (1− α) + α/k = 0.5
B) Realistic Amdahl case
Serial
P roc
T
im
e(
a
rb
.
u
n
it
s)
P0
0
1
2
3
4
5
6
7
8
9
10
S
1
P
1
P
2
P
3
S
2
Parallel
P0 P1 P2 P3 P4
C
1
W
C
2
S
1
P
1 P
2
P
3
S
2
Ttotal =
∑
i Si +
∑
i Ci + Pmax
S−1 = 0.7 ⇒ αeff =
k
k−1
S−1
S
αeff =
3
2
10/7−1
10/7
= 0.45
Figure 1: Illustrating Amdahl’s law for idealistic and realistic cases
also provided a widely used formula, focussing on the parallelizable part of
the code. As the left side of Fig 1 demonstrates, the usual interpretation
implies three essential restrictions:
• the parallelized parts are of equal length in terms of execution time
• the housekeeping (controling the parallelization, passing parameters,
exchanging messages, etc.) has no costs in terms of execution time
• the number of parallelizable chunks coincides with the number of avail-
able computing resources
Essentially, this is why Amdahl’s law represents a theoretical upper limit for
parallelization gain. In Fig 1 the left side shows the idealistic case where
the original process in the single-processor system comprises the sequential
only parts Si, and the parallelizable parts Pi. One can also see that the
control components Ci are of the same nature as Si, the non-parallelizable
components. This also means that even in the idealistic case when Si are
negligible, Ci will represent a bound for parallelization. From the figure the
meaning of α is: in what fragment of time (in terms of the time needed for
3
completely sequential processing) the helper processors are utilized. When
the task is scaled to several processors, the goal is obviously to maximize the
utilization of the helper processors.
The realistic case (shown in the right side of Fig 1) however, is, that the
parallelized parts are not of equal length (even if they contain exactly the
same instructions, the hardware operation in modern processors may exe-
cute them in considerably different times (for examples see the operation of
hardware accelerators inside a core or the network operation between pro-
cessors, etc.) and also that the time required to control parallelization is not
negligible and varying. Here αeff provides a value for an average utilization
of the helper cores. Obviously, the unused cores and the unbalanced load of
cores degrades this average utilization. To characterize the effects like sharing
the processing between different number of helper cores (the performance of
the scaling) or using different hardware conditions, one needs a quantitative
figure of merit.
The figure also calls the attention to the fact that the static correspon-
dence between program chunks and processing units can be very inefficient:
all assigned processing units must wait for the delayed unit and also the ca-
pacity is lost if the number of computing resources exceeds the number of
the parallelized chunks.
2.2. Factors affecting parallelism
Usually, Amdahl’s law is expressed with the formula
S−1 = (1− α) + α/k (1)
where k is the number of parallelized code fragments, α is the ratio of the
parallelizable part to the total sequential part, S is the measurable speedup.
The assumption can be visualized that (assuming many processing units) in
α fraction of the running time the processors are processing, in (1-α) fraction
they are waiting. I.e. α describes how much, in average, the processors are
utilized. Having those data, the resulting speedup can be estimated.
For a system under test, where α is not a priory known, one can derive
from the measurable speedup S an effective parallelization factor as
αeff =
k
k − 1
S − 1
S
(2)
where S is now the measured speedup, and k is the number of the available
cores. Obviously, for the classical case, α = αeff ; which simple means that
4
in idealistic case the actually measurable effective parallelization reaches the
theoretically possible one. In other words, α describes a system the architec-
ture of which is completely known, αeff describes a system the performance
of which is known from experiments. Again in other words, α is the theoret-
ical upper limit, which can hardly be reached, while αeff is the experimental
actual value, which describes the complex architecture and the actual condi-
tions. It is interesting to note, that αeff is an absolute measure of utilizing
the available processing capacity, see section 3. Numerically (1−αeff) equals
with the f value, established theoretically by Karp and Flatt (1990).
The αeff can then be used to refer back to the Amdahl’s classical assump-
tion even in the realistic case when the parallelized chunks have different
length and the overhead to organize parallelization is not negligible. Note
that in case of real tasks a kind of Sequential/Parallel Execution Model, see
Yavits et al. (2014), shall be applied, which cannot use the simple picture
reflected by α, but αeff gives a good merit of the degree of parallelization
for the duration of the execution of the process, and can be compared to the
results of the technology-dependent parametrized formulas.
With our notations, in the classical Amdahl case on the left side in Fig. 1
S =
∑
i Si +
∑
i Pi∑
i Si +maxi Pi
= 2 (3)
and
α = αeff =
∑
i Pi∑
i Si +
∑
i Pi
= 3/4 (4)
Now we can compare the effective parallelization in the two cases shown
in Fig. 1. In the realistical case S = 10/7, which results in
αeff =
3
2
10/7− 1
10/7
= 0.45 (5)
As seen, the overhead and the different duration of the parallelized parts
reduced the effective parallelization drastically relative to the theoretically
reachable value. Fig 2 gives a feeling on the effect of the computer system
behaviour on the effective parallelization. The middle region (marked by
balls) is mentioned by Amdahl as typical range of overhead. The asterisk
in the figure shows the ”working point” corresponding to the values used in
Fig 1.
5
0 0.2 0.4 0.6 0.80
0.2
0.4
0.6
0.5
1
Overhead ratio Sequential ratio
α
ef
f
αeff k = 3,maxi Pi = 0.25
Figure 2: Behavior of the effective parallelization αeff in function of the
overhead ratio (relative to the parallelizable payload execution length) and
the ratio of the sequential part (relative to the total sequential execution
time).
One can see that the effective parallelization drops quickly with both
increasing overhead and sequential parts of the program. This fact draws
the attention to the idea that through decreasing either the control time or
the sequential-only fraction of the code (or both), and utilizing the other-
wise wasted processing capacity, a serious gain in the effective paralleliza-
tion can be reached. This was experienced in the ”dynamic” architecture by
Hill and Marty (2008).
2.3. Different implementations of parallelism
The timing analysis in Fig 1 can be applied to different kinds of paral-
lelizations, from the processor-level parallelization (instruction or data level
parallelization, in the nanoseconds range) to OS-level parallelization (includ-
ing thread-level parallelization using several processors or cores, in the mi-
croseconds range), to network-level (between networked computers, like grids,
in the milliseconds range). The principles are the same, see David et al.
(2013), independently of the kind of implementation. In agreement with
Yavits et al. (2014), housekeeping overhead is always present (and mainly
6
depends on the architectural solution), and remains a key question, although
the mains focus is always on reducing its effect.
The actual speedup (or the effective parallelization) depends strongly on
the ’tricks’ used during implementation. Although HW and SW parallelism
are interpreted differently, they even can be combined, see Chandy and Singaraju
(1999), resulting in hybrid architectures. For those greatly different archi-
tectural solutions it is even more hard to interpret α, while αeff allows to
compare different implementations (or the same implementation under dif-
ferent conditions) in such cases, too.
Notice that in all kinds of parallelization the relative overhead fulfills the
observation made by Amdahl: for better performability the overhead time
cannot exceed dozens of percents relative to the execution time of the par-
allelized chunk. Between networked computers the control times Ci are in
the millisecond range and a lot of data must be transmitted, so for making
parallelization efficient, typically complete jobs (having duration sometimes
even in the minutes range) are sent to the other computers. When using
thread-level parallelizm through OS services inside the computer, the ”ex-
penses” of organizing threads is in the order of thousands of instructions,
and typically the length of the working threads is also at least in that order
or above. In hardware level parallelization, when using hyperthreading, the
values of Ci are in the range of 1 clock cycle. However, the program chunks
Pk are also in the order of a few clock cycles, i.e. in a comparable range.
Similar holds for the speculative evaluation, the out-of-order evaluation, etc.
3. Practical applications
The good metric to select describing parallelism depends on many factors,
see Karp and Flatt (1990). The newly introduced metric αeff describes how
effectively the computing task is distributed between the processing units.
As outlined above, the control functionality as well as unequalities in cutting
program into equally long pieces (including data transfer between processing
units) degrade αeff . Since 1−α gives the sequential-only part of the program,
(1 − αeff ) is expected to describe the ratio of the total (even unintended)
sequential part, i.e. it is a sensitive measure of disturbances of parallelization.
Since a larger load imbalance results in a larger decrease in value of αeff (as
can be concluded from Karp and Flatt (1990), αeff is a kind of derivative
of relative speedup), problems can be identified better than from speedup
values. If the parallelization is well-organized (load balanced, small overhead,
7
100 101
0.2
0.4
0.6
0.8
1
Number of processors
S
p
ee
d
u
p
/N
o
of
p
ro
ce
ss
or
s
Audio stream 1
Audio stream 2
Radar initial
Radar improved
2 4 6 8
10−3
10−2
10−1
Number of processors
1-
α
ef
f
Audio stream 1
Audio stream 2
Radar initial
Radar improved
Figure 3: Relative speedup (left side) and (1 − αeff) (right side) values,
measured running the audio and radar processing on different number of
cores. Sheng et al. (2014)
right number of processors), αeff saturates at unity, so tendencies can be
better displayed through using (1− αeff).
3.1. Characterizing parallelization efforts
In paper Sheng et al. (2014) a compiler making effective parallelization
of an originally sequential code for different number of cores is described
and validated by running the executable code on platforms having the cor-
responding number of cores. Let us apply Equ. (2) to their results, shown
in Figs 8 and 10 in their paper.
Fig. 3 left side displays efficiency (Efficiency = speedup divided by the
number of cores) in function of number of cores for two different processings
of audio streams, and for two processings of radar signals. The data displayed
in the figures are derived simply through reading back diagram values from
the mentioned figures in Sheng et al. (2014), so they may be not accurate.
However, they are accurate enough to support our conclusions.
Based on their merit, the authors of Sheng et al. (2014) can only declare
a qualitative statement, that the ’efficiency’ decreases less steadily (dots on
the figure) with the growing number of cores ”The higher number of parallel
processes in Audio-2 gives better results”), if they consider load balancing.
It can surely be stated, that the improvement was successful: in both
cases the decrease with increasing number of cores is less steep. The dia-
grams cannot tell, however, whether further improvements are possible or
8
100 101
0,2
0,4
0,6
0,8
1
Number of processors
S
p
ee
d
u
p
/N
o
of
p
ro
ce
ss
or
s
Cray Y-MP/8
IBM-3090
Alliant FX/80
2 4 6 8
10−3
10−2
10−1
Number of processors
1-
α
ef
f
Cray Y-MP/8
IBM-3090
Alliant FX/80
Figure 4: Relative speedup (left side) and (1 − αeff) (right side) values,
measured running Linpack on different computers with different number of
parallel processors. Karp and Flatt (1990)
whether the parallelization is uniform in function of the number of cores. In
contrast, the (1−αeff ) diagrams (right side) show also, that in both cases the
improvement decreased the sequential part, i.e. improved the parallelization.
It can also be seen, that in the case of audio stream, the parallelization is
improved and so did the uniformity of parallelization. In the case of radar
signals, without optimization the parallelization decreases as the number of
the cores increases. With load balancing option on, the parallelization is at
any core number gets better. The compiler really does a good job: αeff is
practically constant, the compiler finds nearly all possibilities:
Note that the absolute values in the two cases must not be compared: they
represent the sequenctial-only part of the two programs, and they might be
different for the different programs. The uniformity of the values make also
highly probable, that in the case of audio streams further optimization can
be done, at least for the 2-core and 3-core systems, while processing of radar
signals reached its bounds. In addition, it can also be estimated, that the
non-parallelizable part amounts to ≈< 10%.
Notice that using S and αeff are simply two different points of view of
the same thing. If we have the information, how big is the α fraction of the
code which can be executed in parallel (an architectural point of view), we
can estimate the maximum speedup we can reach. Here we assume that all
processors have the same architecture. The experimentalist’s point of view
9
100 101 102 103
0,2
0,4
0,6
0,8
1
Number of processors
S
p
ee
d
u
p
/N
o
of
p
ro
ce
ss
or
s
Wave Motion
Fluid dynamics
Beam stress
101 102 103
10−3
10−2
10−1
Number of processors
1-
α
ef
f
Wave Motion
Fluid dynamics
Beam stress
Figure 5: Relative speedup (left side) and (1 − αeff) (right side) values,
measured running different algorithms on the same computer with different
number of parallel processors. Karp and Flatt (1990)
is different: if we can measure the speedup, and know how many processing
units was used, we can estimate how big αeff fraction was running in parallel
(assuming the mentioned ideal conditions). Notice also that in deriving αeff
no assumption was made on the code architecture or the nature of the com-
puting units or their way of linking, so the merit can be used to characterize
the effect of the change, if one of the mentioned components changes. It
means, one can use that merit for describing the effect of changing the code
architecture, or the (behavior of the) interconnecting network, the (internal
architecture of the) hardware setup, etc. as well.
3.2. Characterizing HW architectures
In Table I of Karp and Flatt (1990), different architectures are compared,
running the same program (Linpack) on computers from different manufac-
turers and having different number of processors. Because the subject of
the paper was deriving a metric from measured data, here the precision of
the values is much better. The high degree of parallelization results in αeff
values, close to unity, so the value (1− αeff) is used in Figure 4.
As also in the previous case, the efficiency decreases with the increasing
number of cores. The effective parallelization is nearly constant, and the
difference in the absolute values can be attributed to implementation details
of the different computers.
10
100 101 102
10−3
10−2
Number of processors
(1
−
α
ef
f
)
Ring
Neighbourhood
Broadcast
100 101 102
10−3
10−2
Number of processors
(1
−
α
ef
f
)
Ring
Neighbourhood
Broadcast
Figure 6: (1 − αeff) values, measured when minimizing Rosenbrock func-
tion (left side) and Rastrigin function (right side), on the same SoC, us-
ing different communication strategies, in function of the used processors.
de Macedo Mourelle et al. (2016)
3.3. Comparing communication strategies in Systems-on-Chip
In their work de Macedo Mourelle et al. (2016) the authors compare dif-
ferent communication stategies their PSO uses when minimizing the Rosen-
brock function and the Rastrigin function, respectively. As it could be ex-
pected, in the case of the ’broadcast’ type communication the ’sequential’
fraction increases with the number of the processors, in other cases practi-
cally remains constant. The fluctuation shows the limitations of the (other-
wise excellent) measurement precision.
It is worth to compare Fig. 6 with Fig 5, right side. All the three diagrams
show the scaling behavior of some procedures, in function of the processing
units. It would be worth to run the processes shown in Fig 5, in the PSO,
to find out the advantage of having the processing units inside the chip.
3.4. Characterizing scaling of parallelization
In Table II of Karp and Flatt (1990), execution time of different programs
are given in function of processors. The data are shown in Fig. 5. As pre-
sented there, the efficiency drops in a catastrophic way as the number of cores
increases, while (1−αeff) changes only within the limits of the measurement
error. Notice that Figs. 3, 4 and 5 use the same scale, and that the steeper
decrease of efficiency means higher values of (1− αeff).
11
The behavior of efficiency deserves some analyzis. As detailed at the
beginning of section 2.1, the distinguished constituent in Amdahl’s classic
analysis is the parallelizable fraction α, all the rest (including wait time,
non-payload activity, etc) goes into the ”sequential-only” fraction. When
using several processors, one of them makes the sequential calculation, the
others are waiting (use the same amount of time). So, when calculating the
speedup, one calculates
S =
(1− α) + α
(1− α) + α/k
=
k
k(1− α) + α
(6)
hence the efficiency
S
k
=
1
k(1− α) + α
(7)
This explains the behavior of diagram S
k
in function of k in figures above:
the more processors, the lower efficiency. In the case of Fig. 5, (1 − α) is in
the order of 10−3, so the efficiency decreases to 0.5 at 103 processors, while in
the case of Fig. 3, (1− α) is ≈ 10−1, so the efficiency decreases to 0.5 at 101
processors. This is why Amdahl made his very reasonable conclusion:” the
effort expended on achieving high parallel processing rates is wasted unless it
is accompanied by achievements in sequential processing rates of very nearly
the same magnitude” Amdahl, G. M. (1967).
4. Conclusions
With the spread of both multi-core architectures, using different paral-
lelization solutions (like different networking or reconfigurable connection of
cores, etc.) and parallelizing the formerly sequential code either with pro-
grammers’ effort or using parallelizing compilers like the one by Sheng et al.
(2014), it becomes more and more important problem to characterize quan-
titatively the performance of the parallelization. Through inverting the for-
mula known as Amdahl’s law, and re-interpreting the comprised quantities,
such a figure of merit was derived. This experimental quantity correctly de-
scribes the performance of parallelization, allowing to characterize the per-
formance of programmers or parallelizing compilers (see Fig. 3), different ar-
chitectural solutions with many processors (see Fig. 4), different algorithms
in function of the number of the processors (see Fig. 5), as well as describ-
ing the performance of the network connection during running the task, or
12
quantifying the synchronization method used between the computing units.
The introduced merit seems to be an adequate measure of the performance
of the technology used for parallelization, unlike the formerly used quantity
(speedup divided by the number of computing units).
References
Agarwal, V., Hrishikesh, M., Keckler, S., Burger, D., 2000. Clock Rate ver-
sus IPC: The End of the Road for Conventional Microarchitectures. In:
Proceedings of the 27th Annual International Symposium on Computer
Architecture.
Amdahl, G. M., 1967. Validity of the Single Processor Approach to Achieving
Large-Scale Computing Capabilities. In: AFIPS Conference Proceedings.
Vol. 30. pp. 483–485.
Chandy, J. A., Singaraju, J., 1999. Hardware parallelism vs. software
parallelism. In: USENIX Summer Conference.
URL http://static.usenix.org/event/hotpar09/tech/full_papers/chandy/chandy_html
David, T., Guerraoui, R., Trigonakis, V., 2013. Everything you always
wanted to know about synchronization but were afraid to ask. In: Pro-
ceedings of the Twenty-Fourth ACM Symposium on Operating Systems
Principles (SOSP ’13). pp. 33–48.
de Macedo Mourelle, L., Nedjah, N., Pessanha, F. G., 2016. Reconfigurable
and Adaptive Computing: Theory and Applications. CRC press, Ch.
Chapter 5: Interprocess Communication via Crossbar for Shared Mem-
ory Systems-on-chip.
Fuller, S. H., Millett, L. I., 2011. Computing Perfor-
mance: Game Over or Next Level? Computer 44, 31–38,
http://download.nap.edu/cart/download.cgi?&record_id=12980.
Hill, M. D., Marty, M. R., 2008. Amdahls Law in the Multicore Era. IEEE
Computer 41 (7), 33–38.
Karp, A. H., Flatt, H. P., May 1990. Measuring parallel processor perfor-
mance. Commun. ACM 33 (5), 539–543.
URL http://doi.acm.org/10.1145/78607.78614
13
Pingali, K., Nguyen, D., Kulkarni, M., Burtscher, M., Hassaan, M. A.,
Kaleem, R., Lee, T.-H., Lenharth, A., Manevich, R., Me´ndez-Lojo, M.,
Prountzos, D., Sui, X., Jun. 2011. The Tao of Parallelism in Algorithms.
SIGPLAN Not. 46 (6), 12–25.
URL http://doi.acm.org/10.1145/1993316.1993501
Sheng, W., Schu¨rmans, S., Odendahl, M., Bertsch, M., Volevach, V., Leu-
pers, R., Ascheid, G., 2014. A compiler infrastructure for embedded het-
erogeneous MPSoCs. Parallel Computing 40, 51–68.
S(o)OS project, 2010. Resource-independent ex-
ecution support on exa-scale systems.
http://www.soos-project.eu/index.php/related-initiatives.
Yang, J., Cui, H., Wu, J., Tang, Y., Hu, G., 2014. Making Parallel Programs
Reliable with Stable Multithreading. Communications of the ACM 57 (3),
58–69.
Yavits, L., Morad, A., Ginosar, R., 2014. The effect of communication and
synchronization on Amdahls law in multicore systems. Parallel Computing
40 (1), 1–16.
14
