Algorithm Parallelization using Software Design Patterns by Vincke, Robbie et al.
ANNUAL JOURNAL OF ELECTRONICS,  2013,  ISSN 1314-0078 
80 
Algorithm Parallelization using Software Design 
Patterns 
 
Robbie Vincke, Nico De Witte, Sille Van Landschoot, Eric Steegmans and 
Jeroen Boydens 
 
Abstract – Multi-core systems are becoming mainstream. 
However, it is still a big challenge to develop concurrent 
software. Parallel Design Patterns can help in the migration 
process from legacy sequential to high-performing parallel 
code. Therefore we propose a layered model of parallel design 
patterns. When going through the layered model in a top-
down approach, the developer is guided through the 
transition from sequential to parallel code. The value of the 
layered model is shown using a cycle/chain-detection 
algorithm. Two different design pattern approaches are used 
in order to compare performance impact.  First, an 
application is implemented with a minimum on algorithm 
modifications using the Map-Reduce design pattern. Next, the 
same algorithm is rewritten using the graph theory. 
 
Keywords – Multi-core, Parallel Design Patterns, Algorithm 
Parallelization 
 
I. INTRODUCTION 
 
 The demand for high-performance embedded systems 
increased significantly over the last years. Moore’s Law 
stated that the number of transistors on a single chip would 
double each 18 months [1]. It is no longer possible to 
increase the number of transistors on a single processor 
core, due to limiting physical factors. Therefore chip 
manufacturers made the move to multi-core architectures. 
At first, the shift from single core to multi-core was mainly 
visible in high-performance and desktop computing. 
Today, even manufacturers of embedded systems are 
shifting to multi-core architectures for a wide variety of 
applications. Moreover, these architectures are needed to 
meet increasing performance requirements. 
  
 However, writing good and efficient parallel software is 
still a challenge. Challenges in the use of locking 
mechanisms can introduce deadlocks and unsynchronized 
access. In their turn they can lead to non-deterministic 
results. Since the use of such low-level locking 
mechanisms should be avoided, a higher abstraction level 
is needed. In this situation parallel design patterns prove 
their value.   
 
 In Section II the use of parallel design patterns for multi-
core embedded software is briefly described. In Section III 
a case study algorithm will be described, following by an 
overview of the parallel implementations of the algorithm 
in Section IV. Section V summarizes the performance 
measurements of different implementations. The testing 
strategy for migrating from sequential software to parallel 
software is described in section VI. Section VII provides 
pointers for future work and the paper and the paper.     
 
II. PARALLEL DESIGN PATTERNS 
 
 A design pattern is the description of a general solution 
to a recurring problem [2]. Parallel design patterns differ 
from regular design patterns, as described in Gamma et al. 
 [2], in their way of being structured in abstraction levels. 
Ranging from design patterns for low-level implementation 
techniques, to design patterns with high-level architectural 
models. In [3] we proposed a layered structure for parallel 
embedded software based on Our Pattern Language 
(OPL) [4]. OPL is used for high-performance computing 
applications. The proposed structure consists of multiple 
layers as shown in Figure 1, which should be applied in a 
top-down approach. 
 
PARALLEL SOFTWARE
PARALLEL EXECUTION DESIGN SPACE
SUPPORTING STRUCTURES
OPTIMIZATION DESIGN 
SPACE
FINDING CONCURRENCY DESIGN SPACE
SEQUENTIAL SOFTWARE
Architectural Patterns
Structural Patterns
Algorithm Strategy Patterns
Implementation Strategy Patterns
NEW CONCEPT
INTERFACE 
TESTS
 
FIGURE 1. LAYERED MODEL OF PARALLEL DESIGN PATTERNS 
 
III. CASE STUDY 
 
 The case study is based on an algorithm that processes 
input for a laser cutter. The algorithm used for 
parallelization and optimization is based on coupling 
vectors to detect chains of elements. A list of elements, 
restricted to lines and arcs, is the input for the algorithm. 
The output must be a list of cycles and chains, where each 
consists of one or more elements. The direction of vectors 
and arcs does not matter, since the resulting cut path is the 
same. An example is shown in Figure 2.  
 
 
{Robbie.Vincke, Nico.DeWitte, Sille.VanLandschoot}@khbo.be 
are scientific staff members at KHBO funded by IWT-110174. 
Eric.Steegmans@cs.kuleuven.be is a professor and faculty mem 
ber at KU Leuven, dept. CS, research group DISTRINET. 
Jeroen.Boydens@cs.kuleuven.be is a professor at KHBO and an 
affiliated researcher at KU Leuven, dept. CS, research group 
DISTRINET. 
 
  
ANNUAL JOURNAL OF ELECTRONICS, 2013 
 81
A) B)
 
 
FIGURE 2. (A) ILLUSTRATES AN INPUT LIST OF VECTORS. (B) 
ILLUSTRATES THE CORRESPONDING OUTPUT CYCLE. 
 
 Subsection A will briefly describe the used platform. 
Subsection B explains the original sequential 
implementation of the algorithm.   
 
LISTING 1: PSEUDOCODE OF THE ALGORITHM 
1 
2 
3 
4 
5 
6 
7 
8 
9 
10 
11 
12 
while(inputList != empty){ 
  if(currentElement has no subse-   
quent element in both directions){ 
    Chain c; 
    c.Add(inputList(0)) 
    inputList.remove(0); 
    currentElement = c[0]; 
    } else { 
    e = getSubsequentElement(); 
    currentChain.Add(e); 
    inputList.remove(e); 
    currentElement = e;}} 
 
 
A. Platform 
 
 All experiments are conducted on a Intel Core i7-
2720QM (2.2 GHz) quad core processor with support for 
Hyper Threading.  
 
B. Reference Implementation 
 
 The original sequential implementation is a loop -as 
shown in Listing 1 line 1- which checks every element for 
a subsequent element in both directions (preceding or 
following the current element). This is visible in Listing 1 
line 2. If an element is found it will be added to the current 
chain. After iterating over the complete list of elements, the 
current chain is finished and a new one is created.  
 
IV. PARALLEL IMPLEMENTATION 
 
 The first thing to do is to determine which parts of the 
algorithm require most execution time, using the layered 
model as reference. Parallelization attempts will be 
focusing on these processing-intensive parts.   
 
 Profiling tools from Visual Studio showed functions 
equivalent to getSubsequentElement() in Listing 1 required 
more than 90% of CPU time. These are actually search 
algorithms which are now implemented using the basic 
linear search algorithm.  
 
 This search algorithm was re-implemented into parallel 
code, without changing the search algorithm thus still using 
a linear search algorithm. Since changing the algorithm 
completely, would cause a speedup due to algorithm 
optimization instead of due to parallelization. Parallel 
speedup is what is aimed for in this paper. Therefore we 
implemented the algorithm in two different ways. One 
using Map/Reduce as a high-level structural pattern, which 
is explained in Section IV.A, and one using the Graph 
Theory on a high-level. This Graph Theory is described in 
Section IV.B.  
 
A. Map/Reduce Based Implementation 
 
 The parallel implementation using the Map/Reduce 
Design Pattern [5] is similar to the reference 
implementation. The algorithm itself has hardly changed. 
In the sequential implementation the input list was fully 
iterated by one processor core. In the parallel 
implementation the original list of elements is divided into 
sub lists. Each sub list is mapped to a specific processor 
core or thread. This is done using the Task Parallel Library 
(TPL) [6], available since .NET4.5.  
 
 When a subsequent shape is found the parallel tasks 
return a value that contains the subsequent shape. Since it 
is possible to find more than one subsequent element an 
additional reduce step is required. To merge the results of 
the parallel tasks all sub results of each task are stored in an 
array. Next the array is iterated to add all subsequent 
elements to the current chain. The Map/Reduce design 
pattern is illustrated in Figure 3. 
 
 On a lower level TPL uses the thread pool design 
pattern [5]. This pattern initially creates a set of worker 
threads. Next, a shared queue is filled with tasks. Each 
worker thread pops a task from the shared queue for 
execution. This limits the overhead of thread creation to a 
minimum since the set of worker threads is only created 
once for all tasks. 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
DATASET
CPU core
CPU core
CPU core
CPU core
Result
Data A
Data B
Data C
Data D
FIGURE 3. MAP/REDUCE DESIGN PATTERN 
ANNUAL JOURNAL OF ELECTRONICS, 2013 
 82
B. Graph Theory Implementation 
 
 Connecting elements to each other in order to detect 
cycles (or a chains of elements) is very similar to Graph 
Theory [7]. Each element (regardless whether it is a line or 
an arc) can be seen as a so called edge. In graph theory 
edges connect nodes or vertices to each other. In our 
algorithm a node is represented as the area (within a certain 
precision) where two elements interconnect.  
 
 Representing the list of elements as a graph allows us to 
use the mathematical properties of graph theory. Since the 
direction of the vectors is unimportant (for example in 
Figure 2.B) and cycles occur in the list, we can use the 
mathematical properties of “cyclic undirected graphs”.  
 
 As illustrated in Figure 4, a graph can be represented in 
an adjacency matrix A. The input list of length n can be 
represented as an n x n square matrix where each element is 
a specific row or column. Since the graph is undirected, the 
adjacency matrix is symmetric [7]. The value on position 
Aij of the adjacency matrix is ‘1’ if vector i and vector j are 
adjacent to each other, otherwise its value is ‘0’.  
 
FIGURE 4. GENERATING THE ADJACENCY MATRIX OF AN 
UNDIRECTED GRAPH 
 
 In order to find cycles and chains a visitor strategy is 
performed on the adjacency matrix. This can be done by 
using a depth-first search or breadth-first search algorithm. 
In depth-first search algorithms a root element of the graph 
is selected and the graph is visited until all subsequent 
elements are found, thus each branch is searched as far as 
possible until a complete cycle or chain is found. On the 
other hand, a breath-first algorithm first a root element is 
selected, next all neighboring nodes. Since this breath-first 
algorithm could provide non-deterministic results we will 
not use it. 
 
 The software implementation of the depth-first algorithm 
acts in two stages. The first stage generates the adjacency 
matrix of the input list. The second stage visits the 
adjacency matrix in order to find cycles or chains. Remark 
that the creation of this adjacency matrix is an additional 
calculation step in comparison to the previously used 
algorithm. The 2-stage process is shown in Figure 5. 
A) B) C)
  
 
FIGURE 5. (A) REPRESENTS THE INPUT LIST. (B) REPRESENTS THE 
GENERATED ADJACENCY MATRIX. (C) REPRESENTS THE FOUND 
CYCLE. 
 
 In the first stage, the generation of the adjacency matrix, 
a nested loop is going through all elements and checks 
whether any other element is adjacent to the current 
element. The nested loops are easily parallelizable using 
the Parallel Loop design pattern [5] as illustrated in 
Figure 6. Iterations are spread over the available number of 
cores or threads. The operations inside these iterations are 
independent of each other, so no synchronization is needed.  
 
For(i=0;i<8;i++){…}
CPU0 CPU1
i=0
i=4
i=1
i=5
CPU2 CPU3
i=2
i=6
i=3
i=7
 
 
FIGURE 6. PARALLEL LOOP DESIGN PATTERN 
 
  
V. MEASUREMENTS 
 
 Measurements are made using different input element 
lists. The original sequential implementation, as proposed 
in Section III, will be used as the reference implementation 
for computing the parallel speedup of the Map/Reduce 
based implementation. A sequential variant of the Graph 
Theory implementation will be used as reference 
implementation for computing the speedup of the parallel 
Graph Theory algorithm. The relative speedups are shown 
in Figure 7. 
 
 Using the Map/Reduce strategy there is no speedup for 
small input datasets. This is expected, since the creation of 
threads introduces overhead in comparison to the workload 
of each thread [8]. At the point of approximately 1000 
input elements, it becomes useful to use a multithreaded 
environment for the Map/Reduce implementation.  
 
 
 
 
 
 
ANNUAL JOURNAL OF ELECTRONICS, 2013 
 83
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Whereas using the Graph Theory implementation 
parallel speedup is achieved even for small input datasets. 
Moreover, the parallel Graph Theory implementation 
scales better to a quad core environment since the speedup 
characteristics flattens close to a 3.5x speedup, whereas the 
Map/Reduce implementation flattens close to 2.5x speedup. 
   
 
 
 Figure 8 shows the absolute execution times for both 
algorithms in seconds. It is clear that the Graph Theory 
implementation is generally slower (except for very small 
input datasets).  
VI. TESTING 
 
 The original sequential algorithm is developed using a 
test-driven strategy [9]. The unit-tests used for the 
sequential algorithm are reused when migrating to parallel 
software. If a function named f2() is being parallelized, a 
new unit-test is added to the test suite named 
test_parallel_f2(). This test has exactly the same 
implementation as the original unit-test test_ f2(). Next we 
implemented functionality to the parallel functions just to 
get test_parallel_f2() passing.  All unit-test for functions 
relying on f2() should still pass with the parallel version of 
f2(). No changes should be made to the unit-test itself.  
 
 The complete unit-test suite is retained when migrating 
from the sequential to the parallel Map/Reduce and the 
graph theory versions. For the latter additional unit-tests 
where needed to check the correct generation of the 
adjacency matrix. 
 
 
VII. FUTURE WORK 
 
 First, in order to test the correct execution of all 
implementations, back-to-back tests should be 
implemented. The output file of the original algorithm 
should be compared to the output file of the parallel 
implementations in an automatic way. This would improve 
portability and testability of future implementations.  
 
 Next, different search algorithms could provide an even 
more notable speedup. 
 
VIII. CONCLUSION 
 
 We have shown a notable performance speedup is 
achievable when applying the layered top-down model to 
parallelize an existing sequential implementation. We have 
shown that different design patterns can differ in terms of 
parallel speedup and execution time. The Graph Theory 
implementation scales very well for almost all sizes of 
input datasets. But in terms of absolute execution time it is 
still slower than the Map/Reduce equivalent. This can be 
explained by the additional stage needed to generate the 
adjacency matrix.  
 
 As previously mentioned the goal of the layered top-
down model was avoid low-level locking mechanisms. In 
both parallel implementations no locking or 
synchronization primitives were needed.  
REFERENCES 
[1] G. E. Moore, “Cramming more components onto integrated 
circuits,” Electronics Magazine, vol. 38, no. 8, p. 4, 1965. 
[2] E. Gamma, R. Helm, R. Johnson, and J. Vlissides, Design 
Patterns: Elements of Reusable Object-Oriented Software. 
Addison-Wesley, 1995. 
[3] R. Vincke, S. Van Landschoot, E. Steegmans, and J. Boydens, 
“Refactoring Sequential Embedded Software for Concurrent 
Execution Using Design Patterns,” in Annual Journal of 
Electronics, 2012. 
[4] K. Keutzer and T. G. Mattson, “Our Pattern Language (OPL): 
A Design Pattern Language for Engineering (Parallel) Software.” 
2009. 
[5] T. G. Mattson, B. A. Sanders, and B. Massingill, Patterns for 
parallel programming. Addison-Wesley, 2005. 
[6] Microsoft, “Task Parallel Library (TPL).” [Online]. Available: 
http://msdn.microsoft.com/en-us/library/dd460717.aspx. 
[7] C. Godsil and G. F. Royle, Algebraic Graph Theory 
(Graduate Texts in Mathematics). Springer, 2001, p. 462. 
[8] J. Dean and S. Ghemawat, “MapReduce: simplified data 
processing on large clusters,” in Communications of the ACM - 
50th anniversary issue: 1958 - 2008, Volume 51 Issue ,1Pages 
107-113, 2008. 
[9] K. Beck, Test Driven Development (The Addison-Wesley 
Signature Series). Addison Wesley, 2002, p. 240.  
 
FIGURE 8. ABSOLUTE EXECUTION TIME 
FIGURE 7. RELATIVE PARALLEL SPEEDUP 
