Unified Identification of Multiple Forms of Parallelism in Embedded Applications by Aguilar, Miguel Angel & Leupers, Rainer
Unified Identification of Multiple Forms of
Parallelism in Embedded Applications
Miguel Angel Aguilar, Rainer Leupers
Institute for Communication Technologies and Embedded Systems
RWTH Aachen University, Germany
{aguilar, leupers}@ice.rwth-aachen.de
Abstract—The use of Multiprocessor Systems on Chip (MP-
SoCs) is a common practice in the design of state-of-the-art
embedded devices, as MPSoCs provide a good trade-off between
performance, energy and cost. However, programming MPSoCs
is a challenging task, which currently involves multiple manual
steps. Although, several research efforts have addressed this
challenge, there is not yet a widely accepted solution. In this
work, we describe an approach to automatically extract multiple
forms of parallelism from sequential embedded applications in
a unified manner. We evaluate the applicability of our work by
parallelizing multiple embedded applications on two commercial
platforms.
I. MOTIVATION
Multicore systems have emerged as a response to the
demands of applications in the embedded industry such as
performance, power and cost. In order to enable the full
potential of these systems, applications have to be properly
parallelized. However, current programming practices for de-
scribing applications in a parallel paradigm involve multi-
ple manual steps that result in a low software productivity:
identifying computational intensive code sections, finding data
dependencies, identifying parallelism patterns and extracting
the most profitable ones. This challenge has motivated research
efforts in two complementary directions: paradigms for par-
allel programming and tools for parallelization. The approach
presented in this work falls into the latter one. The aim is to
provide a parallelization toolflow that fulfills the needs of a
realistic industrial environment.
II. BACKGROUND
Several programming paradigms have been proposed to
express parallelism such as OpenMP [1] and MPI [2], and
in the embedded domain, Dataflow Models of Computation
(MoCs) have gained acceptance [3]. Despite these efforts, the
developer still has the cumbersome task of parallelizing the
code manually. Therefore, academic and industrial research
efforts have focused on tools for parallelization such as the
MPSoC Application Programming Studio (MAPS) [4]. How-
ever, there is not yet a widely accepted solution and many
issues remain open: i) multiple patterns of parallelism have to
be explored [5]; ii) frameworks based on static analysis often
fail to extract parallelism due to the use of pointers [6]; iii)
static analysis is not accurate enough to perform a cost benefit
analysis of parallelization opportunities [6]; iv) parallelizing
frameworks typically do not consider the characteristics of the
underlying embedded platform [3].
Fig. 2: Parallelization Toolflow
III. PARALLELIZATION APPROACH
In order to address the challenges described previously,
we have extended the parallelization toolflow of the MAPS
framework, which identifies multiple forms of parallelism in
an unified way. Fig. 2 shows the main components of our
approach. It starts with a sequential C application, a model of
the MPSoC and constraints provided by the developer. Then
a program model that contains both dependency and perfor-
mance information is built in a hybrid fashion by combining
static (at compile-time) and dynamic (at run-time) analyses.
Performance information helps to identify computationally
intensive functions (called here parallelization candidates) to
focus the analysis on them, and thus reduce the problem
space. The parallelization candidates are identified based on a
user-defined threshold, which defines the minimum workload
required to consider a given function as a candidate.
Once the program model is built, it is analyzed by algo-
rithms that try to expose multiple forms of parallelism [3],
[7]. Fig. 3 shows the forms of parallelism considered in this
work. In Task Level Parallelism (TLP), a computation is split
into multiple parallel tasks that operate on different data sets,
as Fig. 3a shows. Data Level Parallelism (DLP) is a form
of parallelism typically found in scientific and multimedia
applications, in which the iteration space of a given loop is
split into multiple tasks as long as there are no loop-carried
(a) TLP (b) DLP (c) PLP
Fig. 3: Parallelism Patterns. DS: Data Set, T: Task, R: Result
Be
am
fo
rm
er
Ed
ge
D
et
ec
tio
n
JP
EG
D
ec
od
er
LT
E
PN
G
D
ec
od
er
W
eb
p
D
ec
od
er
Av
er
ag
e
0
1
2
3
4
S
p
ee
d
u
p
(a) Nexus Tablet — Quad-Core ARM MPSoC
Be
am
fo
rm
er
Bl
ow
fis
h
Ed
ge
D
et
ec
tio
n
JP
EG
D
ec
od
er
LT
E
Tr
el
lis
Av
er
ag
e
0
2
4
6
8
S
p
ee
d
u
p
(b) TI Keystone Platform — Eight-core DSP MPSoC
Fig. 1: Speedup Results
dependencies. As Fig. 3b shows, in this form of a given
loop computation is replicated into multiple parallel tasks
that operate on different data sets. Finally, in Pipeline Level
Parallelism (PLP) a computation within a loop is broken into
a sequence of processes (called pipeline stages), which can
be executed in parallel, as Fig. 3c illustrates. The resultant
parallelization information is presented to the developers in
the form of source level annotations, in order to guide the
process of deriving a parallel representation of the application
in the selected parallel paradigm.
IV. EXPERIMENTAL EVALUATION
The toolflow was implemented on the LLVM compiler
framework and integrated on the Eclipse GUI. In order to
evaluate the performance gains of our approach, we analyzed
representative embedded applications on two commercial de-
vices: i) the Android Nexus 7 tablet [8], which is based
on a quad-core MPSoC; ii) the Keystone multicore DSP
platform from Texas Instruments, which is an eight-core DSP
MPSoC [9]. Using the parallelization hints of our toolflow,
we derived a parallel representation of each benchmark using
a commercial language called C for Process Networks (CPN),
which is provided by Silexica Software Solutions GmbH [10].
Table I presents the characteristics of the benchmarks in
terms of the number of lines of code, total number of functions,
number of parallelization candidates, and forms of parallelism
exploited. We can observe from this Table that in big bench-
marks (e.g. PNG and WebP) the number of parallelization
candidates is small compared to the total number of functions.
This observation supports the idea that our approach is scalable
as we only focus on computationally intensive functions. On
the other hand, Fig. 1 shows the speedup results, which were
obtained by computing the ratio between the sequential exe-
cution time and the parallel execution time on each platform.
We can observe from this Figure that the benchmarks that
exploit DLP scale better than the others. The reason for this
is that loops that present DLP can be well distributed across
the all the available cores in the platforms, which leads to a
high parallelization efficiency. In general, on the Nexus Tablet
we obtained an average speedup gain of 2.3x, while on the
multicore DSP platform we obtained an average speedup gain
of 4.5x.
TABLE I: Characteristics of the Benchmarks
Benchmark LOC Functions Candidates Parallelism
Beamformer 2K 7 2 DLP
Blowfish 1K 4 2 DLP
Edge Detection 1K 4 2 TLP, DLP
JPEG Decoder 2K 32 3 PLP
LTE 4K 39 1 DLP
PNG Decoder 27K 170 8 PLP
Trellis 1k 14 1 DLP
Webp Decoder 23K 149 6 PLP
V. SUMMARY
In this work we described an approach that identifies
multiple forms of parallelism from embedded applications.
The approach is based on a program model that is built
using static information and dynamic information. We reduced
the problem space by analyzing computationally intensive
functions only. The applicability was evaluated by parallelizing
embedded benchmarks on two commercial platforms. On the
Nexus Tablet we obtained an average speedup gain of 2.3x,
while on the Keystone platform we obtained an average
speedup gain of 4.5x. In the future, we are going to extend
the toolflow to heterogeneous MPSoCs. Moreover, we are
planning to consider energy as new optimization goal.
REFERENCES
[1] “The OpenMP API Specification for Parallel Programming,” [Online]
Available http://www.openmp.org (accessed 01/2015).
[2] M. P. Forum, “Mpi: A message-passing interface standard,” Knoxville,
TN, USA, Tech. Rep., 1994.
[3] J. Castrillon et al., Programming Heterogeneous MPSoCs: Tool Flows
to Close the Software Productivity Gap. Springer, 2014.
[4] W. Sheng et al., “A compiler infrastructure for embedded heterogeneous
mpsocs,” Parallel Comput., vol. 40, no. 2, pp. 51–68, Feb. 2014.
[5] M. Islam, “On the limitations of compilers to exploit thread-level
parallelism in embedded applications,” in Computer and Information
Science, 6th IEEE/ACIS International Conference on, July 2007, pp.
60–66.
[6] G. Tournavitis, “Profile-driven parallelization of sequential programs,”
Ph.D. dissertation, University of Edinburgh, 2011.
[7] M. A. Aguilar et al., “Parallelism extraction in embedded software for
Android devices,” in Proceedings of the XV International Conference on
Embedded Computer Systems: Architectures, Modeling and Simulation,
ser. SAMOS XV, jul 2015.
[8] “Nexus 7 (2013),” [Online] Available http://www.asus.com/Tablets
Mobile/Nexus 7 2013/ (accessed 01/2015).
[9] “Keystone Multicore Devices,” [Online] Available http://processors.wiki.
ti.com/index.php/Multicore (accessed 10/2015).
[10] “Silexica Software Solutions GmbH,” [Online] Available http://www.
silexica.com (accessed 4/2015).
