A comparative study of the effects of parallelization on ARM and Intel based platforms by Fellows, Kurt
A	  COMPARATIVE	  STUDY	  OF	  THE	  EFFECTS	  OF	  PARALLELIZATION	  ON	  ARM	  AND	  INTEL	  BASED	  PLATFORMS	  
BY	  KURT	  M.	  FELLOWS	  
THESIS	  Submitted	  in	  partial	  fulfillment	  of	  the	  requirements	  for	  the	  degree	  of	  Master	  of	  Science	  in	  Electrical	  and	  Computer	  Engineering	  in	  the	  Graduate	  College	  of	  the	  University	  of	  Illinois	  at	  Urbana-­‐Champaign,	  2014	  
Urbana,	  Illinois	  Advisers:	  Professor	  Josep	  Torrellas	  Professor	  Sayan	  Mitra	  
	   ii	  
Abstract 
 
With the enormous growth in popularity of mobile devices in the past decade, there has been a 
large push in industry for chip designers and manufacturers to develop powerful yet energy 
efficient processors. Increasing the parallelism available in the hardware has proven to be a great 
way to maintain and even improve performance while sustaining a manageable power budget. 
Specialized hardware such as graphics processing units, multicore systems and vector units are 
some of the hardware that has allowed the goal of improving performance while maintaining 
energy efficiency to be realized. These examples of specialized hardware are able to provide 
great benefits to applications that have computationally intensive algorithms. 
 
Such algorithms like video stabilization, object detection and 3D gaming, to name a few, are 
excellent candidates for making use of this hardware. Also, applications like these are just a few 
among the many computationally intensive applications found on mobile devices today. This 
work examines the effects of optimizations using some of the previously mentioned hardware on 
two different platforms. The first is an ARM based development board and the second an Intel 
based Ultrabook. Similar optimizations are applied to two computer vision applications. These 
optimizations are applied on two different levels. First, optimizations were made on a thread 
level and included utilizing vector units and manipulating control flow to more effectively use 
the cache. The second set of optimizations was made on a processor level and involved making 
use of the multiple cores on a chip with OpenMP and Thread Building Blocks.  
 
	   iii	  
We based the performance of the platforms on three metrics: throughput, energy per frame and 
throughput per energy, a metric similar to that of the energy-delay product. After performing 
varying combinations of the optimizations, we ultimately found the Intel based Ultrabook to be 
the better choice of platform. On the more memory bound vision application, the best 
configuration on the Ultrabook had a throughput of almost 4x that of the ARM development 
board with 2x the energy efficiency. The results for the more compute bound application were 
closer, with the Ultrabook’s best configuration having a throughput of less than 3x that of the 
development board and only about 1.5x as energy efficient. 
 
 
 
 
 
 
 
 
 
	   iv	  
Acknowledgments 
 
First and foremost, I would like to thank Professor Maria Jesus Garzaran for her patience and 
helpfulness in guiding me through the difficult process that is graduate research work. I would 
also like to thank Professor Josep Torrellas and Professor Sayan Mitra for being my advisors and 
allowing me the opportunity to pursue research that I found interesting and challenging. I want to 
thank all those who helped me in any way with this thesis, whether it was doing the ground work 
before I was attached to the project, helping me develop the code or reviewing this report as it 
developed. Finally, I would like to thank all of my family and friends for their love, support and 
encouragement during my time in graduate school. 
 
 
 
 
 
 
 
 
 
 
 
 
 
	   v	  
Table of Contents 
Chapter 1. Introduction……………………………………………………………………………1 
Chapter 2. Environmental Setup and Literature Review………………………………………….4 
2.1 ARM Platform……………………………………..………………………………….4 
2.2 Intel Platform…………………………………...……………………………………..5 
2.3 TBB…………………………………………...……………………………………….6 
2.4 OpenMP……..………………………………….……………………………………10 
2.5 OpenCL………………………………………………………………………………14 
2.6 Intel’s PCM…………………………………………………………………………..14 
2.7 Odroid Energy Reading……………...………………………………………………15 
2.8 Power Meter………………………………………………………………………….16 
2.9 Clock…………………………………………………………………………………17 
2.10 Literature Review………………...…………………………………………………17 
Chapter 3. ViVid………………………………………………………………………………....19 
3.1 ViVid Overview………………..…………………………………………………….19 
3.2 Blockwise Distance…………………...……………………………………………...21 
3.3 Histogram…………………………………………………………………………….22 
3.4 Pairwise Distance…………………………………………………………………….22 
Chapter 4. SRAD………………………………………………………...………………………24 
4.1 SRAD Overview……………………………………………………………………..24 
4.2 SRAD Motivation……………………………………………………………………24 
4.3 SRAD Kernels…………………...…………………………………………………..25 
Chapter 5. Methodology………………………………………………...……………………….27 
	   vi	  
5.1 Serial Optimizations……………………….…………………………………………27 
5.2 Threadwise Optimizations…………………………………………………………...36 
5.3 Other Points of Interest.…..………….………………………………………………44 
Chapter 6. Results……………………………………………….……………………………….46 
6.1 Performance Metrics………………………………..………………………………..46 
6.2 Effects of Optimizations………………………….………………………………….47 
Chapter 7. Conclusion………………………………………..…………………………………..70 
7.1 Serial Optimization Comparison…………………………………………………….70 
7.2 Threadwise Optimization Comparison………………………………………………71 
7.3 Platform Comparison…………………………………...……………………………72 
Chapter 8. Future Work………………………………………………………………………….76 
References.……………………………………………….………………………………………77 
Appendix A. Throughput per Energy Tables....…..…………...……………………...………….79 
 
 
 
 
 
 
 
	   1	  
Chapter 1. Introduction 
 
In the past 15 years, there has been a significant shift in the types of processors developed and 
sold. Computers that sat at a workstation were common and required the user to bring the work 
to them. Consumers now want the computing capabilities of a desktop computer, but also want 
the compactness, mobility and convenience afforded by hand-held devices. With the demand for 
desktop computers waning, the industry has put greater emphasis on developing processors for 
the mobile era.  
 
A common tradeoff in the early development of processors was increased performance for 
decreased energy efficiency. Seeing as computers were immobile and had a constant power 
supply, this tradeoff was largely considered acceptable until two trends began to emerge.  
 
The first trend was more of a physical limitation of the hardware. With every generation of 
processors, about every 18 months to two years, the number of transistors on a chip doubled. 
This increase in the number of transistors and subsequent higher clock speeds brought with it a 
higher core temperature that threatened to damage the chip. Techniques were developed to detect 
and reduce the temperature of single core, complex, high clock speed chips. Despite these 
techniques, core temperatures continued to rise. To avoid these higher temperatures while 
maintaining the increased complexity and performance of the processors, more than one core 
was added to a chip, allowing for potentially the same or increased computing performance, as 
well as the reduction of clock speeds. With more cores to do the work, a single processor would 
not need to work as fast or, subsequently, as hot. 
	   2	  
The second trend was more economical. The advent of mobile phones, hand-held gaming 
devices, digital cameras, tablets etc. urged industry to focus on chips that could not only pack a 
computational punch, but could do so for extended periods of time, while running solely off of a 
battery. To satisfy this “have your cake and eat it too” mindset of the consumers, engineers 
shifted their focus away from simply trying to decrease latency to also trying to improve energy 
efficiency. To accomplish this, one approach taken has been to add more heterogeneity to 
systems to address these two consumer demands. Multi-core architectures, graphics processing 
units, digital signal processing units and other similar accelerators have been used to provide 
more specialized hardware to both alleviate the workload on the cores as well as to use energy 
more efficiently. Using the extra transistors, chip designers have found ways to provide the 
compute power that consumers want, while not overexerting the battery of the device.  
 
In the days when processors focused more on increasing the clock speed, applications were 
largely written serially; decreasing latency was the main concern. However, today, throughput is 
also a metric that is considered when assessing the quality of applications. In today’s world of 
streaming media, being able to consistently process data is often just as important as how quickly 
an individual item can be processed. Typical data streaming applications for mobile devices 
often deal with media such as images, audio and video.  
 
The work presented in this report compares an ARM based development board with an Intel 
based Ultrabook. To compare them, we applied various optimizations to two applications that 
process video, ViVid and Speckle Reducing Anisotropic Diffusion (SRAD), and compared the 
performance results. We applied two main forms of optimizations. The first were those that 
	   3	  
could be made to code running with one thread on a single core, such as vectorization and 
memory tiling. The second utilized the underlying multicore architecture to achieve parallelism.  
 
We found that overall the Intel based Ultrabook performed better overall in both applications. 
We compared the results from multiple configurations with different levels of compiler 
optimizations and, aside from a few instances of the Odroid performing better in terms of energy 
efficiency, the Ultrabook was clearly the better overall device for processing these applications 
in terms of our metrics. 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
	   4	  
Chapter 2. Environmental Setup and Literature Review 
 
In this chapter, we discuss the environmental setup of this report. We begin by discussing the 
two platforms on which the experiments were conducted. We then discuss OpenMP and Thread 
Building Blocks, the infrastructures used to created parallelism over the cores of the platforms. 
We also discuss the tools used to take measurements in while running the applications. Finally, a 
review of previous and related works is presented. 
 
2.1 ARM Platform 
The results in this report were obtained on two platforms. The first platform is an ODROID-XU 
development board hosting ARM’s big.LITTLE architecture. This platform has 4 “big” A15 
cores, 4 “LITTLE” A7 cores and an integrated GPU. For the remainder of this report, we will 
refer to this board as the Odroid. The specifics of the Odroid can be found in Table 1.  On the 
Odroid board, we used GNU’s GCC compiler, version 4.8.1-10ubuntu8 [1]. There were four 
important compiler flags that we used on the Odroid. The first three were “-mcpu=cortex-a15”,   
“-mfpu=neon” and  “-mfloat-abi=softfp”, which specify the name of the target ARM processor, 
the available floating point hardware and finally the floating-point application binary interface. 
These three flags were always included in each compilation on the Odroid. The fourth compiler 
flag specified the level of compiler optimization and was either “-O0” or “-O2”. An important 
consideration is that for the Odroid, execution is done on either the big or little cores, but never 
simultaneously. This gives us a maximum of 4 threads (1 per core). All of the computation 
reported here will be done on the larger A15 cores since they are more similar in computing 
power and clock speed to our second platform.  
	   5	  
Table 1: Platform Specifications 
 
Platform Odroid Ultrabook 
Processor Type Samsung Exynos5 Octa Intel Core i5-3337U 
# of Cores 4 x A15, 4 x A7  (only A15 used) 2 
Base Clock Speed 1.6 GHz 1.8GHz 
L3 Cache size (shared) none 3072 KB 
L2 Cache size (shared) 2048 KB  2 x 256 KB 
L1 Cache size (per core) 4 x 32 KB    Instruction 4 x 32 KB              Data 
2 x 32 KB    Instruction 
2 x 32 KB              Data 
Lithography 28 nm 22 nm 
 
2.2 Intel Platform 
The second platform is an Intel based Ultrabook. It contains a dual-core Ivy Bridge Architecture 
also with an available GPU. This platform will be referred to as the Ultrabook from here on. The 
specifics of the Ultrabook can also be found in Table 1.  The compiler that was used on the 
Ultrabook was Intel’s ICC compiler, version 13.0.1.119 [2]. When examining the effects of the 
parallelization techniques, the -O0 compiler flag was used on both the Odroid and Ultrabook 
signifying that the default settings would be used rather than the optimizations for speed or code 
size. This was done in order to fully examine the effects of the specific optimizations made 
rather than experience any compiler optimizations. There were three important compiler flags 
used in this work. For specifying the level of compiler optimization, we used either /O0 or /O2. 
For specifying the type of vector instruction, we used either “/arch:IA32”, “/arch:SSE” or 
“/arch:AVX”. Finally, to specify the level of floating-point precision, we used the compiler flag 
“/fp:precise”. Note that while the Ultrabook only has two cores, it too has up to four total threads 
(two per core) available through hyperthreading. This allows us to compare the effects of having 
up to four threads on both the Ultrabook and the Odroid, although it is important to keep in mind 
that no hyperthreading is done on the Odroid. 
	   6	  
In the next two sections, we describe the two interfaces, OpenMP and Thread Building Blocks, 
which allowed us to run our applications on the underlying hardware. Both OpenMP and Thread 
Building Blocks were applied to both applications on both platforms. 
 
2.3 TBB 
Thread Building Blocks (TBB) is a C++ Library developed by Intel to allow applications to take 
advantage of multi-core systems [3]. While TBB is developed at Intel, we used a version that 
worked on ARM based processors as well. TBB consists of functions and data structures to 
simplify and automate the creation, execution, and releasing of threads across different cores. 
Common parallel algorithms, such as scan, reduce, parallelizing for loops, as well as more 
complex algorithms such as pipeline creation and parallel sorts, are available using TBB.  
 
2.3.1 TBB Parallel_for 
TBB provides an algorithm for allowing the parallelization of loops called parallel_for. This 
allows loops, which have no inter-iteration dependencies, to execute simultaneously on multiple 
threads and cores. For the remainder of this report, when we use TBB’s parallel_for algorithm, 
we will refer to it as TBB parallel_for or as simply parallel_for. In order to attain a more 
balanced workload and have better core utilization, parallel_for employs a task stealing 
approach. This approach is based on a “task graph”, which often resembles a binary tree and 
evolves over time. Each task, or node, spawns children tasks at the next level. As this graph is 
constructed and changed, each thread maintains a list of tasks that are ready to run, with each set 
of tasks corresponding to a different level in the task graph. Naturally, at each new level in the 
graph, the amount of ready tasks at a node decreases. At execution time, each thread uses a Last-
	   7	  
In First-Out strategy of deciding which tasks to complete next. As execution progresses, each 
thread pulls the groups of tasks from the level most recently added to their node in the task 
graph. Once a thread completes all of its tasks from this level in the task graph, the thread looks 
to other executing threads and attempts to “steal” tasks from this same level and execute them. 
Once all tasks from the current level have been executed, all threads begin execution on the next 
level of tasks yet to be completed.  
 
An important consideration is the size of the groups of tasks. There is a non-negligible amount of 
overhead that is involved in the creation and scheduling of threads. Thus, in order to attain the 
highest level of performance, a balance must be found between certain factors. If tasks are 
broken down into groups that are very small, the workload can be very well balanced among the 
available threads; however, the overhead incurred will significantly affect performance. On the 
other hand, if groups that are too large are used, the overhead will be minimized, but the 
workload can become very unbalanced also resulting in negative performance results.  
 
In many cases, the computation space and number of cores and threads are known a priori, 
possibly making the partitioning of tasks fairly straightforward. However, the actual duration of 
one of these tasks may not always be known beforehand due to unpredictable factors such as 
memory traffic, I/O with the real world and other such factors. This being the case, TBB 
provides a more dynamic approach to its task scheduling than merely dividing the workload 
evenly among the available threads. TBB provides three different partitioners; simple, auto and 
affinity. Alongside the partitioners is another parameter called grain size. The grain size gives 
the partitioner the absolute minimum size to make one of the task groups. The simple partitioner 
	   8	  
makes the most use of the grain size since, in accordance with its name, the available workload 
of tasks is recursively split until the group size reaches or is under the given grain size. The auto 
partitioner is more complex than the simple. This partitioner focuses more on balancing load, but 
will not necessarily continue splitting until the provided grain size is reached depending on the 
overall amount of tasks completed. The affinity partitioner adds even more complexity than auto. 
This partitioner is similar to auto but its main goal is to improve cache affinity by more 
intelligently assigning ranges of tasks to threads.  For all parallel_for and parallel_reduce 
instances in this report, the affinity partitioner is used.  
 
2.3.2 TBB Parallel_reduce 
TBB’s parallel_for is a useful tool for when there are no dependencies between loops or side 
effects. However, not all loops in this report meet these criteria. One of the loops is a reduction 
of an array of variables. This loop does not readily translate into the use of a parallel_for. 
Fortunately, TBB provides an algorithm called parallel_reduce that can compute the necessary 
reductions without allowing for data races. While convenient, parallel_reduce does require 
additional code in order for the reduction to be realized.  
 
The task partitioning done in a reduction is similar to that done in a parallel_for. The same 
partitioners and grain size parameters are available to the programmer. The total tasks are broken 
down into a treelike structure until the total number of tasks in a group is at the limit specified by 
the partitioner. In addition to this, a C-style struct must be defined with two main helper 
functions, operator and join. The operator function defines the reduction on a range of tasks that 
meets the task group size determined by the partitioner. In this range, one thread is performing 
	   9	  
the computation and there are no data races. Once the thread finishes, the join function is called. 
This function performs the reduction across task groups and is always done with the reduced task 
group from which it was originally split. This same pattern is recursively executed up the task 
graph until the last two task groups are combined into one and the reduction finishes. 
 
2.3.3 TBB Pipeline 
TBB also provide functionality for creating a pipeline. In this pipeline, filters are created which 
act as stages in the pipeline. In our application, these filters correspond to the various kernels. 
These pipeline filters maintain a strict ordering. The pipeline will assign a frame to a particular 
thread, which will, in turn, execute all filters associated with that pipeline without the assistance 
of any other thread. Thus there can be as many elements being processed as there are available 
threads. One of the main benefits of the TBB pipeline is the smaller amount of synchronization 
that is necessary. Since each thread is computing its own work element, and since each work 
element is independent, no thread is required to wait for any other threads to complete the 
processing of their element before moving on. However, when simply parallelizing a loop with 
parallel_for or with OpenMP, between each filter, all threads must wait for the slowest thread to 
reach the barrier before proceeding. The pipeline’s independence is due to the fact that, in both 
applications chosen for this work, frames can be produced independently. However, the stages 
taken to compute a frame must be sequential. A potential drawback of using a pipeline is that 
since threads in a TBB pipeline are not forced to synchronize with other threads, they will put a 
strain on resources such as functional units and memory since each thread will want as many 
resources as needed to progress in computation. This has the possibility, if memory reuse is low 
in a filter, of putting much higher stress on the memory system and can thus produce a 
	   10	  
bottleneck. For the remainder of this report, TBB’s pipeline will be referred to as either the TBB 
pipeline or simply as the pipeline. 
 
2.4 OpenMP 
Open Multi-Processing (OpenMP) is an API developed by the Khronos group that supports 
multi-core processing [4]. Like TBB, OpenMP was used on both the Odroid and the Ultrabook. 
The features of OpenMP that this report uses consist mostly of compiler directives. OpenMP 
incorporates the notion of fork-join to achieve its parallelism by having a master thread fork a 
specific number of slaves threads upon reaching the appropriate places in the code.  
 
While TBB abstracts away the notion of being able to interact with specific threads, OpenMP 
allows threads to have a unique identification number that can allow for more specifically tuned 
execution of code.  
 
The general form for an OpenMP directive in C/C++ is as follows: 
 
#pragma omp construct [clause [clause]…] 
 
where construct is substituted with one or more of a list of possible constructs and clause is 
substituted with zero or more possible clauses. Following the OpenMP directive is a code block 
consisting of either a single line of code or a section of code enclosed by curly braces. The 
construct and subsequent clauses are only applied in the single code block following the 
OpenMP directive. There are three constructs and four clauses that were used in this report in 
	   11	  
order to properly achieve accurate results and performance. These constructs are parallel, for and 
critical. The clauses are private, firstprivate, reduction and nowait. They are discussed in more 
detail below. 
 
2.4.1 OpenMP Parallel 
The constructs parallel and for are closely related and are, in fact, often used in the same 
directive to improve code clarity. Our implementation uses them both together and separately. 
Whenever a thread encounters the parallel construct, a group of threads is created to perform the 
parallel section. The number of threads created can be determined by a clause in that directive. 
Environment variables can also be used to determine specifics such as the number of available 
threads. Our implementation does not use this approach but rather uses the function 
omp_set_num_threads( int ) to specify the number of threads that are to be used when setting up 
the experiments.  
 
2.4.2 OpenMP For 
The for construct allows parallelization of loops across multiple threads in C/C++.  The directive 
containing the for construct must be immediately followed by a for loop otherwise a compilation 
error will thrown. The number of iterations must be known by the time the for loop begins 
execution, but not necessarily at compile time. Before execution of the for loop begins, the total 
number of iterations of the for loop are divided amongst the available threads. Our approach does 
not do any load balancing as the execution progresses, but rather simply statically divides the 
work between the threads. For the remainder of this report, when an OpenMP parallel for is used, 
we will refer to it as either an OpenMP parallel for or as simply OpenMP. For any other use of 
	   12	  
OpenMP’s parallel or for constructs, we will refer to them by their appropriate names. 
 
2.4.3 OpenMP Critical 
The critical construct is a synchronization construct in OpenMP that serializes the execution of 
the code within the block following the directive. Before entering a critical section, a thread must 
check whether another thread is currently executing inside that section. If another thread is 
currently executing, then the thread attempting to enter is blocked until the executing thread 
leaves the critical section. Due to some restrictions when using a similar construct called atomic, 
we chose to use the critical construct, even though the atomic construct often incurs less 
overhead when executing. 
 
2.4.4 OpenMP private and firstprivate 
The private clause provides each task with a local copy of each variable in a given list. Each 
variable in this list must be a variable that has been declared previously. The lifetime of this 
private variable lasts until the thread exits the code block in which the private variable is created, 
at which time the original variable will once again be used. This new variable is created as if it 
had been declared normally without any initialization, meaning that its contents cannot be 
guaranteed to hold any specific value. The firstprivate clause is similar to the private clause in 
that a local variable is created that is visible only to one thread. firstprivate differs in that private 
variables are assumed to be uninitialized whereas firstprivate variables retain their original value 
upon entering the code block. The firstprivate clause was only used when using a vector of 
constants in a loop that we did not want to have to reinitialize multiple times. firstprivate was not 
a necessary component for correctness, but rather used to improve performance. 
	   13	  
2.4.5 OpenMP reduction 
OpenMP provides a clause to perform a reduction in a code block with a parallel or for 
construct. The reduction clause follows the following form: 
 
#pragma omp construct reduction ( operator : variable-list) 
 
where construct may be omitted if the directive is already within another OpenMP code block. If 
there is no construct specified and the directive is not within another OpenMP code block, the 
code block in the reduction section will be performed serially on one core. The variable-list is to 
be composed of one or more expressions with scalar type and the operator is to be chosen from a 
predefined set of non-overloaded operators. OpenMP creates a private copy of each of the 
variables in the variable-list and executes the code in the code block. Once the parallel region 
ends, a reduction is done over each thread’s local copy of the variables with the given operator. 
The version of OpenMP that we used did not support SIMD instructions. Despite this, there were 
several ways that a reduction could have been realized using SIMD instructions, and we chose 
the approach that showed the best performance benefits that will be discussed below. 
 
2.4.6 OpenMP nowait 
The nowait clause is used only with the four OpenMP work-sharing constructs, for being one of 
them. The nowait clause disables the synchronization feature that blocks threads at the end of a 
parallel for section. nowait allows threads to continue execution as soon as they reach the end of 
the code block. In our experiments, nowait was used in code blocks that were immediately 
followed by critical sections. This was done because the critical section already included a 
	   14	  
synchronization component. So rather than blocking all threads only to have the majority wait to 
enter a critical section, nowait was used to allow threads to attempt to enter the critical section as 
soon as possible. 
 
2.5 OpenCL 
Open Computing Language (OpenCL) is a framework for programming systems that include 
multiple compute units such as multicore CPUs, GPUs, DSPs and other accelerators [5]. This 
report explores various parallelization techniques for multicore processors, but the applications 
presented here could also find performance benefits from using other accelerators, namely GPUs. 
Totoni et al. [6] explored various parallelizing techniques for ViVid using both the multicore 
CPU and the GPU on the Ultrabook platform. Unfortunately, the necessary drivers were not 
available on the Odroid to provide us with access to programming the GPU. Thus, exploring the 
performance gains by utilizing the GPU for SRAD on the Ultrabook and for both ViVid and 
SRAD on the Odroid is left as a future work. 
 
The following three sections deal with techniques that we used to measure the energy consumed 
by both platforms. Section 2.6 discusses the technique we used to measure energy on the 
Ultrabook and Section 2.7 relates how we measured energy on the Odroid. Section 2.8 discusses 
a method that was used on both platforms to double check the measurements recorded from 
sections 2.6 and 2.7. 
 
2.6 Intel’s PCM 
In order to better measure and improve performance, many of Intel’s processors have been fitted 
	   15	  
with performance monitoring units (PMUs), which dynamically obtain data during a program’s 
execution. Intel has also developed a library of functions, datatypes and algorithms entitled 
Performance Counter Monitor (PCM), which interacts with the PMUs on a chip [7]. PCM 
provides an interface that enables the user to measure many specifics about the performance of 
the chip including L2 and L3 cache hit ratios, number of instructions retired over a given amount 
of time, and energy consumed by the processor, memory and total chip, to name a few. In order 
to report these measurements, the following scheme is used. PCM makes a function call, which 
retrieves the data accumulated in the PMUs at a specific point time and then stores this data into 
various structs. After the data has been stored, execution of a program resumes for a time and the 
PMUs continue to gather data. At some time in the future, the data is read again from the PMUs 
and stored into a new set of structs. At this point, given both the before and after readings, 
functions provided by PCM can supply the user with the requested data. The overhead of reading 
the PCMs is negligible however. The average time taken to read the PMUs both before and after 
takes on average 80µs.  
 
2.7 Odroid Energy Reading 
The Odroid board hosts several sensors that we used to measure certain performance factors. 
Four main components of the Odroid had sensors for them, namely the four A15 cores, the four 
A7 cores, the GPU and the memory system. Sensors for each of these areas of the chip could 
provide readings for instantaneous voltage, current and power. Other sensors were available for 
measurements such as fan speed, current clock frequency per group of cores, clock min and max 
frequencies, core temperatures and more. In order to take these readings the Linux command 
“open” was applied to a specific path to give us the file descriptor of the sensor [8]. That 
	   16	  
descriptor would simply contain the instantaneous reading for that sensor as a floating-point 
number. This value would then be multiplied by the time that passed since the last sensor reading 
so we could convert to energy. This value was then added to the running total for energy. We 
wrote an energy_meter library that would provide us with the functionality that we needed to 
measure energy in the system. This library gave us the ability to take readings from the power 
sensors and provide the approximated total energy consumed. An important point to note is that 
the sensors only provided instantaneous power. Thus, in order to produce an exact calculation of 
the energy consumed, these sensors would theoretically need to be sampled constantly. This 
however was not feasible especially since the sensors were only refreshed periodically. The 
approach that we used was to sample the sensors at a rate of 10Hz. This sampling involved a 
system call from within the C/C++ code. The average time taken to read the power sensors took 
on average 20µs.  
 
2.8 Power Meter 
As a sanity check against the power readings obtained from PCM and the Odroid’s power 
sensors, we used an external power meter with which we connected between the Odroid or 
Ultrabook and the wall outlet. The specific meter that we used was the WattsUp? PRO from 
WattsUpMeters. We recorded the idle power used for both the Odroid and Ultrabook and 
compared them against the power used when the applications were running. The results 
measured on the Ultrabook were on average within 10% of the expected power based on the 
readings from PCM. The results measured on the Odroid were on average within 35%. We 
attributed the extra power to come from cooling systems, memory controllers and other such 
peripherals that were not powered on during the idle measurements. 
	   17	  
2.9 Clock 
When dealing with the duration of certain events in our experiments, we had two main 
approaches. The first was for measuring time so that we could periodically sample the sensors on 
the Odroid at a consistent frequency. For this, we used the C function usleep(int), which made 
the background thread that we used for reading the sensors sleep for 0.1 seconds before waking 
up and sampling the sensors [9]. The second approach was used to actually measure the duration 
of events. For this, we used the tick_count class provided by TBB. The measuring of time with 
this approach is straightforward. TBB defines the tick_count class which has a function now() 
which allows for the computation of wall-clock times. Much like other measurement features 
mentioned before, a tick_count timestamp will be recorded both before and after an event and the 
absolute difference between them was recorded as the duration of that event. 
 
2.10 Literature Review 
In this section, we will discuss similar work that uses SIMD and/or OpenMP and TBB to 
parallelize applications to compare performance. We will also discuss two works that address the 
possibility of mobile System-on-Chips being added to High Performance Computing domain and 
possibly even replacing current HPC processors. 
 
This work is an extension of [6]. In this work, Totoni et al. did the groundwork in optimizing the 
ViVid code on the Ultrabook. They make use of the on-chip GPU as another option for 
offloading the processing of different kernels. Although modifications were made to the existing 
code base, [6] provided the starting point for the remainder of the work reported below. 
 
	   18	  
Many tools have been developed to provide programmers access to underlying multicore 
hardware. Sanchez et al. [10] perform a comparative study of the effects of various 
infrastructures for accelerating applications over shared-memory parallel architectures. 
Michailidis and Margaritis [11] also compare tools, such as TBB and OpenMP, by implementing 
standard scientific algorithms such as matrix-vector multiplication and Gaussian Elimination and 
then analyzing the performance. 
 
There are numerous possible applications for using SIMD operations to improve performance 
and energy efficiency. Here are a few applications. Lacassagne et al. [12] perform similar 
optimizations as found in this work using SIMD instructions to optimize a corner detection 
algorithm on a 512-bit SIMD XeonPhi. Mitra et al. [13] compare the difference in results 
between hand-tuned and compiler auto-vectorized NEON and SSE2 code executing Gaussian 
Blur, Sobel filter, edge detection and others and Wan et al. [14] use ARM’s NEON to accelerate 
AVS video decoding. Finally, Heinecke et al. [15] use SIMD operations and Intel’s Array 
Building Blocks to accelerate data mining from sparse graphs. 
 
Finally, Rajovic et al. [16] experiment with the notion that mobile processors, similar to those 
used in this report, could execute HPC applications with high energy efficiency. In [17], Rajovic 
et al. extend this work. They do so by noting the shrinking gap in performance between mobile 
processors and x86 processors used in current HPC applications. They liken this trend to the one 
that occurred in the 1990’s where x86 processors gradually replaced vector architectures in HPC 
as the performance disparity between them shrank.  
 
	   19	  
Chapter 3. ViVid 
 
In this chapter, we discuss the application ViVid; the first of the two vision applications used in 
this report. We first give an overview of the application followed by a breakdown of the 
algorithms of the three kernels.  
 
We chose vision applications as the basis for our experiments because, in general, they possess 
characteristics that are interesting to explore. The large amount of computation as well as data 
parallelism displays the performance capabilities of the platforms. Also, many vision 
applications, ViVid and SRAD included, are algorithms that consist of a series of filters applied 
to images. These filters have various dependencies that afforded us interesting opportunities for 
applying OpenMP and TBB. We will refer to these filters as stages or kernels from here on. 
 
3.1 ViVid Overview 
ViVid is an object detection algorithm that employs a “sliding window” technique to detect the 
existence of objects with high probability. As the name suggests, a window of a certain size is 
overlaid onto the image. Calculations are made to determine whether or not a desired object lies 
within that window and then the window will shift or “slide” to the next location in the image. 
This process is repeated until the given window has been placed in every possible location on the 
image, yielding the detection of each occurrence of the object in the image. An important 
characteristic of ViVid is that the algorithm consists of a relatively simple series of stages. ViVid 
consists of only three different kernels each of which must be done for every frame, and also 
	   20	  
must be done sequentially. These three kernels are the Filter kernel, the Histogram kernel and the 
Classifier kernel. When processing multiple frames, say in a video sequence, each frame is 
independent of all others and thus processing of frames can be started and finished in any order 
with respect to other frames. Note that this is not always the case with vision applications, but 
rather is simply a characteristic of ViVid. We also added two additional stages that were not 
computationally intensive and will only be discussed briefly here. The first stage was added at 
the beginning of the series of kernels. This kernel handles the creation of the Item object, 
managing which frame is being processed and other bookkeeping information. The second 
kernel that we added was put at the end of the series and handled the freeing of all allocated 
memory and deleting the Item object instantiated in the first kernel. This second kernel is also 
responsible for reorganizing the frames. While the frames may be processed out of order, in 
order to be played back for the user in a discernible manner, they must be reorganized back into 
their original sequence. 
 
We felt that vision applications had certain characteristics that made them attractive options for 
applying and analyzing these various parallelism techniques. The high computational intensity of 
the workload, the multi-staged computation phases, the independence of frames and of elements 
within a frame were all features that allowed us the flexibility to parallelize without having to be 
concerned with sequential inconsistencies. 
 
In terms of workload, the middle kernel, the Histogram, is relatively lightweight in terms of time 
and computation. However, seeing as how we do not use a large part of our optimizations here, 
the analysis of this stage is not an interesting one. The remaining two kernels, Filter and 
	   21	  
Classifier, are relatively similar in terms of computational load; thus, unlike other vision 
applications, no one stage is a major bottleneck in the application. Thus, with a smaller number 
of relatively balanced stages, we feel that ViVid is a good candidate as a simpler benchmark for 
analysis. 
 
3.2 Blockwise Distance 
The first kernel in the ViVid application, Filter, applies the sliding window technique discussed 
above. The input to this kernel is the image data and a set of 100 3x3 pixel filters. The algorithm 
for this kernel is found in Algorithm 1. Each of the filters is applied to a pixel by overlaying the 
filter on top of the image, multiplying the corresponding elements, and summing up the products. 
This process is repeated for each of the 100 filters and the result for this pixel is the index of the 
filter with the highest total weight and the total weight itself. This same process is repeated for 
each pixel in the image. To simplify the calculation and the code, this process was applied only 
to pixels where every filter element would have a corresponding image pixel when the filter was 
overlaid, thus some of the edge pixels in the image are not included in the computation. 
 
max_result = max_index = -INF; 
for each 3x3 image section centered around the desired pixel do 
 for each of the 100 filters do 
  result = 0; 
  for each of the 9 elements in the filter do  
   result += filter[i] * image_section[i]; 
  if result > max_result then 
   max_result = result; 
   max_index = index of current filter; 
  end 
 end 
end  
Algorithm 1: Filter Kernel 
	   22	  
In [6], this filter was optimized for the Ultrabook. We used similar techniques when optimizing 
this filter on the Odroid. The optimizations techniques that we applied will be discussed in the 
Methodology chapter. 
 
3.3 Histogram 
The second filter will be referred to as the Histogram kernel. In this kernel, a histogram of size 
100 is computed, one entry for each filter. The Histogram kernel iterates over the output from the 
Filter kernel and builds a histogram using the filter index array. This histogram, however, differs 
slightly from the standard histogram in that an overall count of occurrences is not kept within the 
histogram entries; rather, the sum of the weights is kept for filters that ended up with the 
maximum weight value. Due to the fact that consecutive entries in the filter index array could 
reference non-sequential elements in the filter, no vectorization was used in the parallelizing of 
this kernel. Only threadwise optimizations were made, and they will be discussed in the 
Methodology section. 
 
3.4 Pairwise Distance 
The third and final kernel of ViVid is the Classifier kernel. This kernel calculates the Euclidean 
distance between the histogram results and the weighted results of the template for the object we 
are looking for. The pattern of the kernel iterating across the data is the same as that for a 
standard matrix multiplication. Unlike a matrix multiplication, instead of computing the dot 
product between two vectors, the Classifier uses the formula in Equation 1 for finding the 
distance between two vectors: 𝑑 𝑝, 𝑞 ! =   𝑑 𝑞,𝑝 ! =    (𝑝! − 𝑞!)!!!!!                                         (1) 
	   23	  
Note that the result is actually the distance squared, not the actual distance. This simplification 
was used to reduce unnecessary computation. If the distance computed is within a given 
threshold, it outputs a response asserting that the object has been detected within that region. 
 
The non-vectorized algorithm for the Classifier is given in Algorithm 2 and gives an example for 
performing the Classifier kernel on a JxK matrix and a KxL matrix to compute a JxL matrix 
result. 
 
for j is 0 to J-1 do 
 for l is 0 to L-1 do 
  sum = 0; 
  for k is 0 to K-1 do 
   diff = A[j,k]-B[k,l]; 
   sum += diff * diff; 
  end 
  C[j,l] = sum; 
end 
end 
Algorithm 2: Classifier Kernel 
 
 
 
 
 
 
 
 
 
	   24	  
Chapter 4. SRAD 
 
This chapter discusses the SRAD application. We begin our discussion with an overview of the 
algorithm as well as our motivation for choosing it as one of our applications. We then discuss 
the functionality implemented in each of the six kernels. 
 
4.1 SRAD Overview 
The second benchmark that we used was another image processing application called Speckle 
Reducing Anisotropic Diffusion (SRAD), and is one of the benchmarks from the Rodinia suite 
developed at the University of Virginia. SRAD is designed for removing noise from images 
using a series of filter stages while maintaining important image features. In total, there are six 
kernels, each of which operates on the entire image and will be discussed in further detail below. 
Similar to ViVid, we also added two extra kernels, one at each end of the original six, to handle 
some of the bookkeeping and reorganizing of frames. 
 
4.2 SRAD Motivation 
In Vivid, the data loaded in from memory had the potential to be used multiple times before 
results were finally computed and additional data would need to be loaded from memory making 
the ViVid application largely more computationally bound. In order to demonstrate the effect of 
the various parallelizing optimizations, we chose SRAD in part because this application has little 
data reuse and therefore is much more memory bound. In fact, as will be shown below, some 
filters do not even use the data loaded from memory for computation but rather load and directly 
	   25	  
store elsewhere in memory so that additional copies are available in future stages. This puts 
much greater pressure on the memory since there is little to no time when the memory is not 
being fully stressed. Another motivating factor was the fact that there were twice as many stages 
in the SRAD computation pipeline as there were in ViVid whose effect we wanted to see. 
 
4.3 SRAD Kernels 
Not including the additional input and output stages, the SRAD implementation that we used had 
a total of six stages. Each stage operated on the entire image and had to be executed in sequence. 
However, individual frames could be processed independently, which allowed us to use the 
pipelining feature offered by TBB. As stated before, SRAD is a more memory bound application 
and thus, each of the stages consists of three or fewer nested for loops with generally only a few 
lines of computation being done in the innermost loop. 
 
4.3.1 Kernel 1 and 6 
These two kernels are very similar in structure. They iterate over each of the pixels in the image 
and perform two operations on them. Filter 1 takes an array of floats representing each pixel in 
the image and performs a division and a natural exponentiation. Filter 6 takes a similar sized 
array and finds the natural log of each element and also performs a multiplication. Kernel 1 and 6 
will now be referred to as the Extraction kernel and Compression kernel, respectively. 
 
4.3.2 Kernel 2 
The second kernel displays the memory intensity of SRAD. The algorithm for the non-vectorized 
version of the code is found in Algorithm 3. This kernel copies each pixel in the original image 
	   26	  
into a new array called sums. This kernel also populates a second array called sums2, which is 
the result of the square of each pixel from the original image. Kernel 2 will be referred to from 
now on as the Preparation kernel. 
 
for i is 0 to NumberOfImageElements-1 do 
 sums[i] = image[i]; 
 sums2[i] = image[i] * image[i]; 
end 
Algorithm 3: Preparation Kernel 
 
4.3.3 Kernel 3 
The third kernel is also very simple in structure. It simply performs a reduction of both the sums 
array and the sums2 array that were created in the Preparation kernel. The results of the 
reductions are stored in the variables total and total2. With a naïve serial implementation, the 
body of the for loop would contain 2 memory reads, 2 memory writes and 2 additions. Kernel 3 
will now be referred to as the Reduction kernel. 
 
4.3.4 Kernels 4 and 5 
Kernels 4 and 5 are the most computationally intensive of the six kernels. The first kernel uses 
differential equations to find the diffusion coefficients of each image element and stores them in 
an array. Then, the second kernel updates the original image using the array of diffusion 
coefficients and the directional derivative in each cardinal direction. From now on we will refer 
to Kernels 4 and 5 as Computation Kernel 1 and Computation Kernel 2 respectively. 
 
 
	   27	  
Chapter 5. Methodology 
 
In this chapter, we will discuss the various optimizations that we applied to the serial version of 
both ViVid and SRAD in our attempts to parallelize the code.  In order to more explicitly 
quantify the effects that these optimizations had on the code, for both ViVid and SRAD, we used 
the -O0 compiler flag on the Odroid and the /Od flag when compiling on the Ultrabook, which 
tells the compiler not to optimize the code in any way.  
 
There were two main forms of optimizations made to these benchmarks. The first set of 
optimizations will be referred to as serial optimizations since they are made on the serial code 
with no respect to multiple threads or multiple cores. The second set of optimizations deal with 
parallelizing the code by utilizing both the multicore architectures and the hyperthreading models 
that are available on the platforms. These optimizations will be referred to as threadwise 
optimizations.  
 
 5.1 Serial Optimizations 
The major serial optimization that was made was the inclusion of vector intrinsics. Vision 
applications have the potential for high data parallelism and vector operations facilitate this fact 
well. Vector instructions can significantly increase performance since they can compute the same 
operation on multiple pieces of data, hence the abbreviation SIMD, which stands for Single 
Instruction, Multiple Data. The main pieces of hardware that enables vector instructions are 
vector functional units and wide hardware registers. Typically, the number of elements and the 
number of bits per element in a vector register can change depending on the particular 
	   28	  
instruction. For this report, however, all elements in the vector registers were 32-bit floats. 
ARM’s SIMD implementation is called NEON and supports vector registers that are 128 bits 
wide, supporting four 32-bit floats. Intel supports two different implementations. Streaming 
SIMD Extensions (SSE) support registers that are also 128 bits wide while Advanced Vector 
Extensions (AVX) supports registers that are 256 bits wide. Each SSE vector intrinsic that we 
used in both applications had a direct AVX counterpart. For using the log and exp functions, we 
simply used intrinsics to extract a 128-bit vector from the AVX registers, called log or exp on 
that extracted data, and inserted the result back into the AVX register. This modification was 
essentially the only difference between any SSE and AVX intrinsics code. 
 
5.1.1 ViVid Serial Optimizations 
In the next few sections, we will discuss the serial optimizations that were specific to the ViVid 
application. We will begin by discussing the memory layout reorganization done in the Filter 
kernel followed by the memory tiling optimization done in the Classifier kernel. We will 
conclude the ViVid serial optimizations section with a brief discussion on loop unrolling. As was 
mentioned previously, in [6], many of the optimizations listed in this chapter were applied to the 
ViVid code on the Ultrabook. These were loop unrolling, vectorization, reorganizing the 
memory layout, and applying the three types of threadwise optimizations. The work done in this 
report improved upon that work by changing the TBB partitioning scheme, replacing slower 
AVX code with non vectorized code in edge conditions and by adding memory tiling to the 
Classifier kernel. 
 
 
	   29	  
Filter kernel optimizations 
The innermost loop of the Filter kernel is very computationally intensive and was a good 
candidate for vectorization. However, with vectors that could compute four or eight operations at 
a time, the code as it is in Algorithm 1 does not translate well. The approach that we took in 
vectorizing this code was to compute the result for multiple filters at once rather than just one at 
a time. The vectorized algorithm is shown in Algorithm 4. In order to facilitate this new 
approach, the bank of filter coefficients was reorganized to make it more accessible for vector 
operations. In the original version of the kernel, a filter’s nine elements were stored in sequential 
memory locations in memory as in Figure 1(a). In the AVX version of the code, for example, we 
are computing eight filters at the same time and so we will need the first element of eight filters 
in sequential memory locations, followed by the second element in those filters and so on as 
shown in Figure 1(b). In essence, we are transposing a matrix of size 9x8 to an 8x9 matrix. 
    (a)               (b) 
Figure 1: Memory layout before reorganization (a),  
Memory layout after reorganization (b) 
 
The first of the two innermost for loops in Algorithm 4 computes a result vector that has 
V_WIDTH results from the same number of filters. Once these results are computed, they are 
checked to see if any is a new maximum. The offset into the filter bank is also updated to point 
to the next section of V_WIDTH filters. In the case of AVX, 8 does not evenly divide into 100 
2	   ….	  1	  	   9	  	   2	   ….	  1	  	   9	  	  
9	  floats	  
1	  	   1	  	   1	  	  ….	  	   2	   2	   2	  	  ….	  	  
8	  floats	  
	   30	  
and so the final four filters were done sequentially as is done in the Algorithm 1. This filter bank 
memory layout optimization is discussed in further detail in the Methodology section. 
 
V_WIDTH = 4 if SSE or NEON, 8 if AVX 
max_result = max_index = -INF; 
for each 3x3 image section centered around an image pixel do 
 f_offset = 0; 
 for filter= 0 to (100/V_WIDTH)-1 do 
  all V_WIDTH elements of result_vector = 0; 
  for i from 0 to 8 do 
  all V_WIDTH elements of image_vector = image_section[i]; 
filter_vector = load V_WIDTH elements at filter[f_offset+i*V_WIDTH]; 
  result_vector += image_vector * filter_vector; 
  end 
 
  f_offset += 9*V_WIDTH; 
 
  for i from 0 to VECTOR_WIDTH-1 do 
   if result_vector[i] > max_result then 
    max_result = result_vector[i]; 
    max_index = index of current filter; 
   end 
  end 
 end 
end 
 
Algorithm 4: Vectorized Filter Kernel 
 
As previously stated, not all code can directly translate into vectorized code and produce 
performance benefits or even correct results. Often, a different approach in algorithm or a 
reorganizing of data is necessary to facilitate utilization of the vector instruction. As was 
discussed in the Filter kernel description, the 100 3x3 filters needed to be reorganized in order to 
utilize the full potential of the vector instructions. Had the original layout been maintained, a 
significant amount of computation would have been wasted each iteration. To illustrate this 
point, let us discuss a simple example. In Figure 2, the light gray area represents a potential 6x6 
	   31	  
pixel image. The dark gray area represents the 3x3 filter location centered around the desired 
pixel. The red outline shows the placement of where vector instructions would be called to 
execute the partial dot product for the first row in the filter. We see that since the filter is only 
3x3, 1/4th of each vector instruction would do computation that would not be included in the final 
result of the dot product. More importantly, 25% of memory loaded and stored would be wasted 
resulting in an increased demand on the memory system. The cache hierarchy would help 
mitigate this, but the effect would still be significant. Another note is that none of the vector 
extensions used had an intrinsic for doing a reduction of the elements in a vector, so in order to 
total each of the valid elements in a vector, we would need further instructions that would rotate 
and sum the partial dot products per row, then additional instructions would be needed to sum all 
three rows to form the final overall dot product. With our approach, we were able to reorganize 
the filters in such a way that four dot products could be computed simultaneously, which 
removed the wasted computation and unnecessary loading of data from memory. 
Figure 2: Vector instruction using original memory layout 
 
Classifier kernel optimizations 
Another optimization that we attempted was memory tiling in the Classifier kernel. Recall that 
the Classifier kernel took two matrices and computed a third following the same pattern as a 
standard matrix multiplication. This pattern involves much data reuse and with the large sizes of 
	   32	  
our matrices, we knew that our lower level caches would quickly fill up and be overwritten 
before any data reuse could occur. In order to compute a single element in the result matrix, an 
entire row from an input matrix and an entire column from the other input matrix would need to 
be loaded from memory. Depending on the size of the cache and its associativity, the data that is 
loaded and used could be overwritten before having an opportunity to be used again. Thus, in 
order to attempt to keep reusable data in the cache and avoid longer latency operations to the 
main memory, we computed the final matrix in tiles. Figure 3 illustrates our general approach. 
Here we show three square matrices of dimensions NxN that have been divided into a 4x4 grid 
of tiles of size TILE_SIZE x TILE_SIZE. We will refer to the matrix with the blue tile as matrix 
A, the matrix with the yellow tile as matrix B and the matrix with the green tile as matrix C. In 
this example, we are attempting to perform A*B=C. By choosing an appropriate TILE_SIZE, we 
can have three total tiles (two operand tiles and one partial result tile) fitting inside our lowest 
level cache. In the figure, the two black tiles from A and B have already been loaded into the 
cache, used to compute a partial result in the green tile and have then been replaced in the cache 
by the blue and yellow tiles once their data is no longer needed. Thus, instead of having to 
potentially reload each element from A and B a total of N times, we would only need to load that 
same data N/TILE_SIZE times. This approach will decrease the pressure on the memory system 
by a factor of TILE_SIZE.  
 
The implementation of memory tiling was fairly straightforward. Algorithm 2 shows the serial, 
non-memory tiled approach used in Classifier. In order to use the memory tiling optimization, 
that algorithm was repeated inside a set of three nested for loops. See Algorithm 5 for the 
pseudocode for the memory tiled approach. The additional for loops facilitated the pattern  
	   33	  
 
illustrated in Figure 3, with loop indexes jj and ll iterating over the tiles in the final result and 
with kk iterating over the tiles of the input matrices. The non-memory tiled algorithm was largely 
unchanged when adding this optimization. The only difference is the terminating conditions for 
the three innermost for loops were changed to terminate at either the end of the current tile, or 
the end of the current row or column of the matrix, whichever came first. 
 
for jj is 0 to J-1, jj += TILE_WIDTH  do 
 for ll is 0 to L-1, ll += TILE_WIDTH do 
  for kk is 0 to K-1, kk += TILE_WIDTH do 
 
   //Non-memory tiled algorithm here 
 
  end 
end 
end 
Algorithm 5: Memory tiled Classifier Kernel 
 
An issue that we came across when developing the AVX version of the code was the fact that K 
was not a multiple of 8. In other instances where the width of a vector did not evenly divide into 
the size of the data, we simply serialized any extra work. In this case, however, the issue came 
with the alignment of the data. Each AVX memory access needs to be 32-byte aligned. However, 
Figure 3: Memory tiling pattern	  
N	  
TILE_WIDTH	  
	   34	  
since K was 16 bytes more than an even multiple of 32, every other row for K was originally 16 
bytes away from the necessary alignment. In order to deal with this issue without having to 
complicate the code unnecessarily, an extra 16-bytes of padded data was added to the end of 
each row. Thus, the final iteration of the innermost loop would compute half of a vector that was 
not used, but the beginning of each row of K length was now 32-byte aligned. This padding 
resulted in around a 4% memory usage overhead. 
 
A significant drawback to using vector intrinsics is programmability. While these intrinsics have 
the look and feel of C-style functions, they all translate to single assembly instructions. 
Programming at a lower level like assembly does give the programmer more direct access into 
what is going on, but it is usually at the cost of development time, code clarity and overall 
performance since compilers can typically do a better job of achieving high performance. Yet 
another downside is that not all code is easily able to benefit from using vector instructions. Even 
in vision applications that are typically computationally intense, there are still instances where 
vector instructions are not always a good fit. For example, in the Histogram kernel from the 
ViVid application, memory access patterns were both unpredictable and non-sequential, making 
this kernel a poor candidate for vectorizing. On this same note, not all applications need four or 
eight of the same operations done all the time. From the classifier kernel in ViVid, we saw that 
when parallelizing with AVX, we ran into the problem of the data not evenly mapping to a 
multiple of eight, resulting in either wasted computation or using an alternative approach on the 
halo of the image. This problem was augmented further due to the fact that AVX memory loads 
and stores need to be 32-byte aligned. This resulted in padding the end of each row with extra 
bytes to make our AVX memory accesses aligned, resulting in increased memory utilization. 
	   35	  
Loop unrolling 
In many instances in the two applications, there were opportunities for unrolling loops that 
would not result in excessively large code size. An example of such a place is in the computation 
of the 3x3 dot product for each of the 100 filters in ViVid’s Filter kernel. Another was for 
performing an addition reduction of a vector whether it was four or eight floats wide. 
 
5.1.2 SRAD Serial Optimizations 
We will now discuss the optimizations specific to the SRAD application. The first section 
discusses the Extraction and Compression kernels followed by a few comments about memory 
accesses in Computation Kernels 1 and 2. 
 
Extraction and Compression kernel optimizations 
The most interesting aspect of the Extraction and Compression kernels was realized when 
attempting to parallelize them using vector instructions. Unfortunately, in the version of the 
instruction set extensions that we used, no instructions were available to perform a direct 
exponentiation or logarithm. Fortunately, we were able to find an open source implementation of 
log and exp that strictly used SSE or NEON instructions [18]. The implementation used the same 
algorithm for both the ARM and x86 ISAs and we felt that the amount of computation would be 
comparable. We were unable to find an equivalent version that used AVX instructions. In order 
to still achieve the same functionality in AVX, we used the instruction _mm256_extractf128_ps 
to separate the current vector into two SSE vector-sized sections, performed the SSE exp or log, 
and then used the AVX instruction _mm256_insertf128_ps to combine the two pieces into the 
original 256-bit vector. The actual code for this process is given in Code Listing 1.  
	   36	  
 
avxVal	  =	  _mm256_load_ps(&item-­‐>d_I[i]); 
avxVal	  =	  _mm256_div_ps(avxVal,avx255); 
sseVal1	  =	  log_ps(_mm256_extractf128_ps(avxVal,0)); 
sseVal2	  =	  log_ps(_mm256_extractf128_ps(avxVal,1)); 
avxVal	  =	  _mm256_insertf128_ps(avxVal,sseVal1,0); 
avxVal	  =	  _mm256_insertf128_ps(avxVal,sseVal2,1); 
_mm256_store_ps(&item-­‐>d_I[i],avxVal); 
Code Listing 1: log_ps using AVX code 
Computation Kernels 
An interesting aspect of the Computation kernels was the fact that there were indirect memory 
accesses where elements in a particular array provided the index for another array. Regardless of 
the vectorization techniques used, these memory accesses were always done sequentially. While 
vector loads and stores can provide improvements when accessing memory, we could not use 
them since we were not able to rely on the fact that the indexes returned from the innermost array 
access would be sequential. Thus, we were not able to vectorize the innermost array accesses or 
the final array accesses in any case.  
 
5.2 Threadwise Optimizations 
We will now discuss the threadwise optimizations made in this report. We will begin by 
explaining the various parallelizing techniques in general. Then we will discuss any specific 
changes that needed to be made or points of interest that were specific to the two applications. 
ViVid will be discussed first, followed by SRAD. 
 
5.2.1 General Threadwise Parallelization Techniques 
With the abundance of available accelerators, such as GPUs, DSPs, FPGAs etc., there are 
	   37	  
numerous ways to distribute an application across different pieces of hardware. In this report, 
since the only hardware that was utilized was a multicore CPU, not as many options of where 
and how to distribute the applications were available. Figure 4 shows the typical execution for an 
application with three stages, being executed by one thread on one core. The first frame 
processed is Frame 1. When processing Frame 1, the sole thread executes the three stages in 
order, moving on to the next stage as soon as the previous one is completed. No synchronization 
is necessary since only one thread is executing. Once Frame 1 is complete, the thread 
immediately begins processing Frame 2 in the same manner.  
 
Overall, two approaches were taken to parallelize our applications using multiple threads. The 
first approach involved dividing the computation of a single stage over the available threads. 
This was done mainly using OpenMP’s parallel for constructs and TBB’s parallel_for algorithm. 
OpenMP’s reduction and TBB’s parallel_reduce were also used to achieve the same granularity 
of threadwise optimization. Figure 5 illustrates this approach. The processing pattern follows that 
in Figure 4. Frames are processed in order as are the stages for each frame. In Figure 5, however, 
each stage is divided into four sections, representing four cores. Each core is given a portion of 
the stage to execute. Once that portion is completed, the thread waits in order to synchronize at 
the barrier between stages. Once all threads reach the barrier, execution proceeds to the next 
stage.  
	  Stage	  1	   	  Stage	  2	   	  Stage	  3	   	  Stage	  1	   	  Stage	  2	   	  Stage	  3	  
Frame	  1	   Frame	  2	  
Figure	  4:	  Single	  threaded	  execution	  of	  a	  three	  stage	  application	  
	   38	  
 
The second approach was to assign the computation of an entire frame by a single thread. This 
was accomplished with TBB’s pipeline and the general pattern is shown in Figure 6. Here we see 
that four cores are executing frames simultaneously, but independent of each other. Each of the 
four cores is assigned an entire frame to process, and since each frame can be processed 
independently of other frames, there are no barriers either between stages or frames. Thus we can 
see that frames may take different amounts of time to execute. Whenever a thread finishes 
processing a frame, however, that thread immediately moves on to the next frame to process.  
 
Both of these parallelizing approaches were possible because of the various levels of  
independence within the applications. The stage-level parallelizing is available due to the fact  
that each element within each stage is computed independently. This fact also allows for 
vectorization of the code. The frame-level parallelism is available since frames do not need to be 
computed in sequence. Due to this fact, different threads in a pipeline could be computing 
different stages at the same time. Thus, the duration and energy consumed during the execution 
of these stages can possibly overlap and give incorrect results. For this reason, the time per stage 
 
 
C	   C	  C	  C	  Stage	  1	  
C	   C	  C	  C	  Stage	  2	  
C	   C	  C	  C	  Stage	  3	  
Barrier	   Barrier	   C	   C	  C	  C	  Stage	  1	  
C	   C	  C	  C	  Stage	  2	  
C	   C	  C	  C	  Stage	  3	  
Barrier	   Barrier	  Barrier	  
Frame	  1	   Frame	  2	  
Figure 5: Multithreaded execution of individual stages in a three stage application	  
	   39	  
and energy per stage for all filters were not measured when running TBB pipeline with more 
than a single thread. In these cases, only the total energy and total times were measured from 
which the average time and energy per stage can be estimated. It must be remembered though 
that the pipeline is not meant to speed up the computation of a particular stage. It merely 
facilitates overlapping its execution with the execution of other stages. 
 
Something important to note is that while elements within a stage and frames themselves can be 
computed independently, stages for a specific frame must be computed in sequence. A negative 
Frame	  2	  
	  Stage	  1	   	  Stage	  2	   	  Stage	  3	   	  Stage	  1	   	  Stage	  2	   	  Stage	  3	  
Frame	  1	   Frame	  8	  
	  Stage	  1	   	  Stage	  2	   	  Stage	  3	   	  Stage	  1	   	  Stage	  2	   	  Stage	  3	  
Frame	  4	   Frame	  6	  
	  Stage	  1	   	  Stage	  2	   	  Stage	  3	   	  Stage	  1	   	  Stage	  2	   	  Stage	  3	  
Frame	  3	   Frame	  5	  
	  Stage	  1	   	  Stage	  2	   	  Stage	  3	   	  Stage	  1	   	  Stage	  2	   	  Stage	  3	  
Frame	  7	  
Core	  0	  
Core	  1	  
Core	  2	  
Core	  3	  
Figure 6: Multithreaded pipelined execution in a three stage application	  
	   40	  
about this is that when performing a stage-level parallelization, there will be a point at the end of 
the stage when all but one of the threads are waiting to synchronize. At this point, no forward 
progress is being made and the parallelism in the application has been reduced. With frame-level 
parallelization, there are fewer synchronization points and in fact the only one that actually 
happens comes when the waiting frames have no other work that can be done before the 
application completes. This however, could come with a price, since when there are no waiting 
threads, each thread is using up resources such as functional units and memory bandwidth. The 
effects of each of the optimizations will be discussed further in the Results section. 
 
5.2.2 ViVid Threadwise Optimizations 
As was mentioned earlier, both OpenMP and TBB provide options for parallelizing loops using 
multiple threads. For both the Filter and Classifier kernels in ViVid, each thread is responsible 
for computing a specific section of the final output for that stage. However, in the Histogram 
kernel, each thread can potentially affect the entire output, which leads to the issue of data races. 
We had several different options for dealing with this issue. We could have used locks or 
atomics to serialize the accesses into the resulting histogram or we could have each thread read 
through the entire input, but only modify a certain range of the output. The approach we took 
was that we actually simply allowed the data races to occur. With a small number of possible 
threads and such a large number of entries in the histogram, the probability of the data races 
actually leading to incorrect results was low. Even despite this, data races could and did happen 
yet the shape of the histogram was not drastically changed and the results were still within an 
acceptable range. Thus, we traded a small amount of accuracy for performance since no locking 
or other serialization techniques were used. 
	   41	  
5.2.3 SRAD Threadwise Optimizations 
The most interesting part of the Reduction kernel arose when attempting to parallelize this kernel 
with vector instructions as well as OpenMP or Parallel_for. The overall idea of the reduction is 
to take multiple elements and perform an operation between them to reduce the total number of 
elements to one. However, with multiple threads all trying to update this same final element, we 
run into the problem of data races. To solve this issue with both OpenMP and Parallel_for, we 
needed to take two different approaches. OpenMP has a simple solution for doing reductions 
already implemented in the form of a reduction, which is discussed above. This reduction 
implementation requires each variable in the variable-list to be a scalar, which our vectors are 
not. Also, OpenMP requires the operator to be one of a predefined list of operators. While we 
are, in fact, computing the summation, SSE, AVX and NEON use intrinsic functions and not the 
common ‘+’ operator to perform the addition and thus the standard OpenMP reduction clause 
could not be used when using vectorized code. The code we developed to perform this reduction 
is given in Code Listing 2. 
 
Our implementation computes the reduction by using a vector with each element containing a 
partial sum. This partial sum vector is computed in parallel and is then summed up sequentially 
and stored to the appropriate location. Thus, modifying the partial sum vector is where data races 
can occur since total and total2 are modified once by only one thread. To create the partial sum 
vector, we placed the reduction code inside an OpenMP parallel section and created local 
variables to store each thread’s private partial sum. Once each thread had its partial sum 
computed for both the sums and the sums2 arrays, we placed the code that modified the overall 
total vector inside a critical section, which serialized that section of code. This forced the 
	   42	  
potentially racey part of the code to become serial. After the parallel section ended, we use a 
single thread to sum the elements from the partial reduction vectors and update total and total2. 
 
For the parallel_for version, we ended up not even using the parallel_for algorithm at all. As 
mentioned above, TBB provides the parallel_reduce algorithm and does not put as many 
limitations on things such as datatypes or operators as OpenMP. When using parallel_reduce, the 
definition of a reduction struct and an operator function was required, but the code used for the 
operator was almost identical to existing code.  
 
Another point of interest is the level of precision attained in the computation of this kernel. Due 
to the nature of floating-point operations, the order in which the calculations are done can have a 
drastic impact on the final result. This fact was especially apparent in our reduction code that 
added hundreds of thousands of floating-point numbers with values generally less than ten. With 
the reduction being split into four partial sums with SSE and NEON (or eight in the case of 
AVX), the results were able to be much more accurate since the partial sums stayed within the 
range of the smaller elements longer. The results were even more accurate with multiple threads 
since there were more partial sums spread out over the threads and so by the time the partial 
sums themselves were joined, they were closer to the same level of floating point precision. 
 
 
 
 
 
 
	   43	  
__m128	  sseVal,sseVal2; 
sseVal	  =	  _mm_setzero_ps(); 
sseVal2	  =	  _mm_setzero_ps(); 
	  
#pragma	  omp	  parallel 
{ 
__m128	  sseVal_private,sseVal2_private,sseTemp; 
sseVal_private	  =	  _mm_setzero_ps(); 
sseVal2_private	  =	  _mm_setzero_ps(); 
	  
#pragma	  omp	  for	  nowait 
for	  (int	  i=0;	  i<NumE;	  i+=4)	  { 
sseTemp	  =	  _mm_load_ps(&item-­‐>d_sums[i]); 
sseVal_private	  =	  _mm_add_ps(sseVal_private,sseTemp); 
sseTemp	  =	  _mm_load_ps(&item-­‐>d_sums2[i]); 
sseVal2_private	  =	  _mm_add_ps(sseVal2_private,sseTemp); 
} 
	  
#pragma	  omp	  critical	  (VAL_critical) 
sseVal	  =	  _mm_add_ps(sseVal,sseVal_private); 
	  
#pragma	  omp	  critical	  (VAL2_critical) 
sseVal2	  =	  _mm_add_ps(sseVal2,sseVal2_private); }	  
	  
//reduce	  the	  vectors	  with	  a	  single	  thread	  
float*	  tempArr	  =	  (float*)	  _mm_malloc(sizeof(float)*4,512); 
float*	  tempArr2	  =	  (float*)	  _mm_malloc(sizeof(float)*4,512); 
float	  tempVal	  =	  0.0,	  tempVal2	  =	  0.0; 
	  
_mm_store_ps(tempArr,sseVal); 
_mm_store_ps(tempArr2,sseVal2);	  
	  
for(int	  i	  =	  0;	  i	  <	  4;	  i++){ 
tempVal	  +=	  tempArr[i]; 
tempVal2	  +=	  tempArr2[i]; 
}	  
 
item-­‐>total	  =	  tempVal; 
item-­‐>total2	  =	  tempVal2; 
_aligned_free(tempArr); 
_aligned_free(tempArr2); 
Code	  Listing	  2:	  Reduction	  kernel	  using	  OpenMP	  and	  SSE	  code	  
	   44	  
5.3 Other Points of Interest 
There are several points of interest that are important to mention before discussing the results. 
The first point is that, despite our best efforts, we were not always able to have the exact same 
code running on both platforms. Probably the most obvious distinction between the two code 
bases was the differences in reading and reporting energy consumed. This problem was 
impossible to avoid since the architectures of the platforms are inherently different and sensors 
and performance monitors that exist on one platform do not exist on the other.  
 
Another difference is the different intrinsics available between Intel’s SSE and AVX and ARM’s 
NEON. The first difference to mention is that with the version of SSE that we used, there is no 
fused multiply add. This intrinsic is most useful in the Filter kernel in ViVid when a filter is 
applied over an area in the image. To make up for the lack of this intrinsic, we used an intrinsic 
multiply as an argument to an intrinsic add. The same functionality was achieved in this case, but 
since there are tens of millions of fused multiply-adds per low definition frame, even a slight 
speedup from having a single intrinsic could provide significant benefits. 
 
A second difference in the SIMD code was due to the fact that the version of NEON that we used 
had no divide function. While shocking and potentially extremely problematic, we found a way 
around it by using the reciprocal intrinsic followed by a multiply. Since there is no divide against 
which to compare, it is difficult to say whether a divide intrinsic could provide a performance 
boost. Fortunately, there were not nearly as many divides in our applications as fused multiply-
adds and so the performance benefit would not have been as substantial. However, accuracy, in 
this case, is an important consideration, since doing a floating point reciprocal followed by a 
	   45	  
floating point multiplication could possibly produce different results than a single floating-point 
divide. Our code saw no differences between our NEON implementation and the serial version 
of the code, but the possibility for error was present. 
 
The final major difference between the two SIMD code bases is the log_ps and exp_ps functions 
that were open source. The bulk of the implementation between them is the same; however, in 
the case of the serial computation of the algorithm used, there is an if-then-else statement that 
involves some SIMD “trickery” with using compares and bitwise operators. In these cases, the 
code had somewhat different approaches. The running times of these functions differed between 
the Ultrabook and Odroid platforms, but whether or not they were exactly proportional to the rest 
of the code was difficult to ascertain. If there was any difference in performance, however, it was 
very minor, if at all noticeable. 
 
 
 
 
 
 
 
 
 
 
 
	   46	  
Chapter 6. Results 
In this section, we discuss the results obtained when conducting our experiments. We begin with 
a description and derivation of our three performance metrics. We then discuss the effects that 
the serial and threadwise optimizations had on both of the applications while running on both of 
the platforms. 
 
6.1 Performance Metrics 
Until the mid-1990s, performance was the single most important feature of a microprocessor and 
performance at the expense of energy consumption was a common tradeoff. However, with the 
advent of mobile devices, a much more concentrated effort has been made to improve the energy 
efficiency of microprocessors. In this section, we will use three metrics when detailing the results 
from our experiments. The first metric, throughput, deals strictly with performance and is 
measured in terms of number of frames processed per second.  Its formula is given in Equation 2. 
The second metric deals with the amount of energy that is consumed in order to process an item. 
This metric will be in terms of Joules per frame and the formula can be found in Equation 3.  
 
𝑇ℎ𝑟𝑜𝑢𝑔ℎ𝑝𝑢𝑡 = 𝑁𝑢𝑚𝑏𝑒𝑟  𝑜𝑓  𝐹𝑟𝑎𝑚𝑒𝑠  𝑝𝑟𝑜𝑐𝑒𝑠𝑠𝑒𝑑𝑇𝑜𝑡𝑎𝑙  𝑡𝑖𝑚𝑒  (𝑚𝑠)                                                                                   (2) 
 
𝐸𝑛𝑒𝑟𝑔𝑦  𝑝𝑒𝑟  𝑓𝑟𝑎𝑚𝑒 =    𝑇𝑜𝑡𝑎𝑙  𝐸𝑛𝑒𝑟𝑔𝑦  𝐶𝑜𝑛𝑠𝑢𝑚𝑒𝑑   𝐽𝑁𝑢𝑚𝑏𝑒𝑟  𝑜𝑓  𝐹𝑟𝑎𝑚𝑒𝑠  𝑝𝑟𝑜𝑐𝑒𝑠𝑠𝑒𝑑                                                                           (3) 
 
In the mobile device world, consumers want high performance, but not at the expense of a 
	   47	  
shortened battery life. This, in turn, leads us to the third and final metric used in this study. This 
metric measures a certain correlation between performance and energy consumption and is an 
important one to consider since the two are often inversely related. A similar metric to the one 
used in this paper is the Energy Delay Product [19], which is able to directly compute the 
tradeoff between performance and energy. However, vision applications are often applied to 
streaming input and thus the total number of items to be processed as well as the ending time are 
not known beforehand. This fact makes this third metric a more accurate representation of the 
tradeoff between performance and energy for these applications. The metric used here is a 
measure of throughput per energy and will be referred to as λ in the derivation.  
 
λ =    𝑁𝑢𝑚𝑏𝑒𝑟  𝑜𝑓  𝐹𝑟𝑎𝑚𝑒𝑠  𝑝𝑟𝑜𝑐𝑒𝑠𝑠𝑒𝑑𝑇𝑜𝑡𝑎𝑙  𝑡𝑖𝑚𝑒  𝑇𝑜𝑡𝑎𝑙  𝐸𝑛𝑒𝑟𝑔𝑦  𝐶𝑜𝑛𝑠𝑢𝑚𝑒𝑑𝑁𝑢𝑚𝑏𝑒𝑟  𝑜𝑓  𝐹𝑟𝑎𝑚𝑒𝑠  𝑝𝑟𝑜𝑐𝑒𝑠𝑠𝑒𝑑                                                                                                       (4) 
 
λ =   𝑁𝑢𝑚𝑏𝑒𝑟  𝑜𝑓  𝐹𝑟𝑎𝑚𝑒𝑠  𝑝𝑟𝑜𝑐𝑒𝑠𝑠𝑒𝑑𝑇𝑜𝑡𝑎𝑙  𝑡𝑖𝑚𝑒 ∗   𝑁𝑢𝑚𝑏𝑒𝑟  𝑜𝑓  𝐹𝑟𝑎𝑚𝑒𝑠  𝑝𝑟𝑜𝑐𝑒𝑠𝑠𝑒𝑑𝑇𝑜𝑡𝑎𝑙  𝐸𝑛𝑒𝑟𝑔𝑦  𝐶𝑜𝑛𝑠𝑢𝑚𝑒𝑑                                   (5) 
 
λ =    𝑁𝑢𝑚𝑏𝑒𝑟  𝑜𝑓  𝐹𝑟𝑎𝑚𝑒𝑠  𝑝𝑟𝑜𝑐𝑒𝑠𝑠𝑒𝑑 !𝑇𝑜𝑡𝑎𝑙  𝑡𝑖𝑚𝑒 ∗   𝑇𝑜𝑡𝑎𝑙  𝐸𝑛𝑒𝑟𝑔𝑦  𝐶𝑜𝑛𝑠𝑢𝑚𝑒𝑑                                                                                       (6) 
 
 
6.2 Effects of Optimizations 
This work is largely an extension of the work found in [6]. In that work, the ViVid application 
was parallelized using many of the techniques described in the Methodology section, namely 
vectorizing, reorganizing the data in the Filter kernel, unrolling loops and applying the three 
	   48	  
types of threadwise optimizations. This work improved upon the work done there by changing 
the TBB partitioning scheme, replacing slower AVX code with non-vectorized code in edge 
conditions and by adding memory tiling. The work done on the Odroid platform for both 
applications as well as the work on the SRAD application on the Ultrabook represents entirely 
novel work. The ViVid code was run on 100 frames and the SRAD code was run on 500 frames. 
 
6.2.1 ViVid on Odroid 
We will begin the Results section by discussing the ViVid application on the Odroid. Figure 7(a) 
presents the normalized speedup of the total application and all three kernels with respect to the 
serial implementation of ViVid. Figure 7(b) shows the total energy usage of all four 
configurations. 
 
The configuration with NEON and no memory tiling had the best speedup at 2.0x overall. The 
configuration with the least energy consumption was NEON with memory tiling. The results in 
each of the four groups in Figure 7(a) are normalized with respect to the non-vectorized version 
without memory tiling. Since the Filter kernel is not affected by memory tiling, there are only 
results for Serial and NEON. Also the Histogram kernel is not affected by memory tiling nor 
does it use NEON, and so only one bar is shown. Another point to note is that while NEON 
allows for four similar operations to be completed simultaneously, the speedup gained by using 
NEON over the serial version is only about 2x for the total application. We can see that the Filter 
kernel achieves speedups of 2.3x with NEON while the Classifier is only around 1.75x. We can 
attribute this to the fact that NEON is able to improve computation but memory accesses can still 
prove to be a bottleneck. In the Filter kernel, the grid of 100 3x3 filters will reside in the lowest 
	   49	  
level of the cache hierarchy. This allows for memory bandwidth to not be as significant an issue 
and thus the effect of the vectorization is more easily felt. An unexpected result was the fact that 
memory tiling actually hurt performance here. Multiple sizes of tiles were used to attempt to fit 
the data into different levels of cache, but yielded fairly similar performance results. We attribute 
the performance loss here to the increase in code, especially branches, that were executed by 
putting the extra 3 for loops around the non-memory tiled code.  
                                   (a)                                                                                 (b) 
Figure 7: Vivid single threaded performance on Odroid (a) 
ViVid single threaded energy consumption on Odroid (b) 
 
To view the effects of the single-threaded optimizations in terms of energy consumption, refer to 
Figure 7(b). This figure gives the total energy consumed for the entire chip, including all cores as 
well as the memory system. As expected, the total energy consumed when using the vector 
instructions was about half that of the code when running without them. This shows that there 
are not only performance benefits, but also energy savings experienced when using the vector 
instructions. A result that was rather unexpected was that, despite the fact that the memory tiling 
code experienced performance losses and ran longer; the total energy consumed was less. In fact, 
the serial, non-memory tiled code used around 7% more energy than its memory tiled 
	   50	  
counterpart while the vectorized, non-memory tiled code used 5% more energy than the version 
with memory tiling.  
 
Now let us look at the performance results for the threadwise optimizations of ViVid on the 
Odroid. Each of the results mentioned in this section were normalized against the single-
threaded, non-vectorized, non-memory tiled version of the code. As was mentioned above, the 
performance and energy breakdowns per kernel using TBB’s pipeline are not accurate 
measurements due to overlapping computation; thus we will present only the throughput and 
energy consumed for the total application. The performance results are presented in Figure 8. 
Overall, TBB’s pipeline experienced the best overall performance gains when run with four 
threads and used in conjunction with NEON but no memory tiling. The throughput with this 
configuration was 1.26 fps.  The configuration with the worst throughput was TBB’s parallel_for 
running non-vectorized, non-memory tiled code on one thread. This configuration’s throughput 
was 0.11 fps. The performance gained from increasing the number of threads remained relatively 
consistent with the single-threaded results. The vectorized code with OpenMP experienced a 
3.66x gain with using four threads over just one. TBB’s parallel_for experienced a 3.84x 
speedup and TBB’s pipeline had a speedup of 3.96x when using four threads as opposed to just 
one. A significant reason for the pipeline’s almost linear improvement is due to the fact that there 
are so many fewer synchronization points than OpenMP or parallel_for since the pipeline threads 
never need to wait for the other threads to complete execution. TBB’s parallel_for experienced 
the smallest improvements in terms of both baseline performance and improvements gained with 
any subsequent increases in thread count.  
	   51	  
  
The energy consumption over the various threadwise optimizations was also fairly consistent 
with the single-threaded results from Figure 7(b). The energy consumption results are shown in 
Figure 9. Overall, the best performer with regards to energy consumption was when running the 
pipeline with NEON, with memory tiling, on two threads. This configuration consumed 5.05 
Joules per frame. The worst performer was TBB’s parallel_for without NEON or memory tiling 
on one thread at 15.8 Joules per frame. The versions of the code that used the NEON extensions 
on average consumed around half as much energy as their non-vectorized counterparts. As was 
also seen in the single-threaded results, the memory-tiled versions of the code typically used less 
energy, although the difference was more significant with non-vectorized code.  
 
An interesting result was the curve in energy consumption when increasing the number of 
threads. In every configuration, OpenMP or TBB, vectorized or not, with or without memory 
tiling, increasing the number of threads from one to two brought a decrease in the total energy 
consumed. Increasing from two to three threads saw relatively small change in total energy 
consumed. However, increasing again from three to four threads saw an increase in energy 
 
                       (a)                                                    (b)                                                 (c) 
Figure 8: OpenMP speedup of ViVid on Ultrabook (a) 
TBB Parallel_for speedup of ViVid on Ultrabook (b) 
TBB Pipeline speedup of ViVid on Ultrabook (c) 	  
	   52	  
 consumption although not as drastic as the drop from one to two threads. This bowl shaped 
curve can be attributed to several factors. First, regardless of the total number of threads, both 
OpenMP and TBB have overheads with regards to scheduling threads on cores and partitioning 
the workload. With one thread, only the energy overheads are experienced, but none of the 
benefits from the multicore architecture is felt since there is only one thread to be assigned work. 
By adding threads, the scheduling and partitioning overhead is put to use since the workload is 
now available to be run across multiple cores. The extra energy used when going to four cores is 
largely due to the fact that in the case of the Odroid, each core is turned on and using energy. The 
energy consumption of the scheduling overhead has been distributed across the cores, but the 
fact that now every single core is powered on gives us that extra energy used. 
 
The final point of analysis on the ViVid application on the Odroid is the throughput per energy 
metric discussed above. Table 2 displays the three highest and lowest performing configurations 
according to throughput per energy for ViVid on the Odroid. The best configuration was with 
OpenMP, NEON without memory tiling while running on four threads. The worst configuration 
was with Parallel_for, Serial code with no memory tiling running on one thread. 
 
                    (a)                                                 (b)                                                 (c) 
Figure 9: OpenMP speedup of ViVid on Ultrabook (a) 
TBB Parallel_for speedup of ViVid on Ultrabook (b) 
TBB Pipeline speedup of ViVid on Ultrabook (c) 	  
	   53	  
Table 2: Throughput per Energy results for ViVid on Odroid 
Configuration Throughput per energy 
OpenMP parallel for, NEON, no memory tiling, 4 threads 0.230 fps per Joule 
TBB Pipeline, NEON, no memory tiling, 4 threads 0.228 fps per Joule 
OpenMP parallel for, NEON, with memory tiling, 4 threads 0.210 fps per Joule 
… … 
OpenMP parallel for, Serial, no memory tiling, 1 thread 0.0109  fps per Joule 
TBB Parallel_for, Serial, with memory tiling, 1 thread 0.0073 fps per Joule 
TBB Parallel_for, Serial, no memory tiling, 1 thread 0.0067 fps per Joule 
 
In terms of overall performance in this metric, using NEON with four threads and without 
memory tiling seemed to produce the best results. OpenMP and TBB’s pipeline were very 
similar in outcome with more at almost 10% more than the third best configuration. Also, with 
the exception of TBB’s pipeline with NEON and memory tiling, each increase in number of 
threads also brought with it an increase in throughput per energy. This single difference was due 
to the fact that there was a large jump in energy consumption going from three to four threads 
without a proportional jump in performance which can be seen from Figures 8 and 9. The full 
results of throughput per energy are in Table 8 in Appendix A. 
 
An interesting note in all three threadwise optimization performance results is the average slope 
of the lines between serial and NEON. The fact that the NEON performance improves more with 
the increase of threads reaffirms the fact that this application is largely computation bound. Had 
memory been an issue, regardless of vectorization or number of threads, the average slopes of 
the two lines would have been more consistent. Discussed below is the effect of having even 
wider vector registers on this computation bound application. 
 
 
	   54	  
6.2.2 ViVid on Ultrabook 
Now we will discuss the results from running the ViVid application on the Ultrabook. We 
applied the same optimizations to ViVid on the Ultrabook that were applied on the Odroid 
platform with the addition of AVX. As was mention above, AVX registers are 256 bits wide and 
allow for eight simultaneous single-precision floating point operations. Aside from the few 
exceptions mentioned previously, the SSE code and the NEON code, which have the same width 
of vector registers, were almost identical. The same holds for AVX aside from a few minor 
differences such as the extra data padding, the exp and log functions and, naturally, the reduced 
number of iterations in loop since twice as much work could be done with the same number of 
instructions. In this section, we will discuss the improvements from the optimizations only with 
respect to ViVid on the Ultrabook with a few references to the Odroid for comparison. In the 
Conclusion section, an overall comparison will be made between both platforms using the data 
from both applications. Figure 10 shows the normalized performance speedups from all six 
possible combinations of vectorizing modes and memory tiling. Similar to the Odroid platform, 
only one configuration from the Histogram kernel is shown since there are no serial 
optimizations made. Also, in the Filter kernel, no results with memory tiling are shown as 
memory tiling does not affect performance in this kernel. 
 
Overall, the best performer in terms of speedup was when running AVX code without memory 
tiling. This configuration achieved a total speedup of 2.84x. The worst configuration experienced 
a 5% slowdown from the baseline code. This happened when running serial code with memory 
	   55	  
 
tiling. There are many interesting points to consider from these results. The first is the speedups 
seen in the Filter kernel. Recall that on the Odroid, vectorizing the Filter kernel only led to 
around a 2.3x speedup. On the Ultrabook, however, SSE gains show around a 3.42x speedup 
using registers of the same width as NEON. On the same note, notice that the AVX version of 
Filter only experiences 4.1x speedup although it has registers that are twice as wide as SSE. 
Another point to mention is the difference in speedup in the Histogram kernel. When compiling 
the serial, SSE or AVX version, we used the compiler flags /arch:IA32, /arch:SSE and 
/arch:AVX respectively.  By default, SSE2 code is generated and thus we use the /arch:IA32 
compiler flag which disables the generation of any vector code. While not shown in the figure, it 
is interesting to note that regardless of which level of vectorization used, the Histogram kernel 
runs the exact same code, yet when the /arch:AVX compiler flag is used, the Histogram kernel 
ran on average 30% slower.  
 
Similar to the Odroid platform, vector instructions and memory tiling have a large impact on the 
energy consumed on the Ultrabook, the results of which are shown in Figure 11. The most 
energy efficient configuration was when running AVX code with memory tiling, consuming only 
683 Joules. The least energy efficient code was the serial version also with memory tiling, 
Figure	  10:	  ViVid	  single	  threaded	  performance	  on	  Ultrabook	  
	   56	  
consuming 2,169 Joules. Unlike the Odroid platform, however, memory tiling did not have a 
positive overall effect on energy consumption. Without vectorized code, memory tiling increased 
energy consumption by almost 8% and with SSE generated code, energy consumption increased 
by just under 10%. AVX was actually the only form of vectorization that experienced energy 
savings with memory tiling, although they were negligible. The final point to make is the 
improvement of the vectorizing code over the serial version. The savings on the Ultrabook were 
much more significant than on the Odroid. The SSE non-memory tiled code used only 39% of 
the energy of the serial version while AVX used a mere 34%. 
 
The performance results from the threadwise optimizations are presented in Figure 12. In terms 
of throughput, parallel_for running serial code with memory tiling was the worst performing 
configuration for ViVid on the Ultrabook with a throughput of 0.43 fps. The best performing 
configuration surprisingly also used parallel_for. The best configuration was AVX code, with 
memory tiling running on four threads. It obtained a speedup of 6.1x over the baseline and 
achieved a throughput of 2.69 fps. Overall, the optimizations followed the same general pattern 
as on the Odroid. Recall that a key difference between the two platforms is the fact that the 
Odroid had four A15 cores, each of which had at most only one thread executing per core 
Figure	  11:	  ViVid	  single	  threaded	  energy	  consumption	  on	  Ultrabook	  	  
	   57	  
whereas the Ultrabook has only two cores, but has hyperthreading, which allows up to two 
threads to run per core. This means that with the addition of threads three and four on the 
Ultrabook, hyperthreading is occurring and both cores are already powered on. In the case of 
OpenMP and TBB’s parallel_for, AVX with and without memory tiling experienced the largest 
performance gains. AVX on OpenMP averaged around 0.5x larger speedup than SSE, which is 
roughly on par with the results from Figure 10. AVX with TBB’s parallel_for experienced an 
average speedup of 0.98x more than SSE. Something that was rather surprising was the poor 
performance of the non-vectorized versions of the code on OpenMP and TBB’s parallel_for. The 
best speedup seen there was a mere 2.3x when OpenMP ran on four threads. With parallel_for 
running on one thread, ViVid took almost 50% longer to execute than without parallel_for. The 
slope of the curve tracking performance improvement for the pipeline was relatively constant 
regardless of the serial optimizations. This can be attributed to the fact that the pipeline has no 
synchronization between threads until the end of the application.  
 
Unlike performance, the energy efficiency of the Ultrabook was quite different from the Odroid. 
The results are shown in Figure 13. The best configuration with regards to energy consumption 
was with TBB’s parallel_for, running AVX code with memory tiling on four threads. This 
                    (a)                                                 (b)                                               (c) 
Figure 12: OpenMP speedup of ViVid on Ultrabook (a) 
TBB Parallel_for speedup of ViVid on Ultrabook (b) 
TBB Pipeline speedup of ViVid on Ultrabook (c) 	  
	   58	  
configuration averaged 4.15 Joules per frame. The worst performer for energy consumption was 
also TBB’s parallel_for; running non-vectorized, memory-tiled code on one thread. This 
configuration consumed about 39.23 Joules per frame. The trend of the energy consumption with 
the increase in the number of threads was actually much more anticipated.  Each configuration 
consumed the most energy when running with a single thread and the greatest drop in energy 
usage came from increasing the number of threads to two. The energy consumption of memory-
tiled code was always greater than its non-memory tiled counterpart when using non-vectorized 
instructions. When using vector instructions, this trend was also seen, although not constantly. 
The energy efficiency of the two types of vectorizing code varied greatly across the OpenMP and 
TBB parallelizations. With OpenMP, SSE averaged an energy consumption of around 65 Joules 
more than the code written in AVX. With TBB’s parallel_for, this difference was even greater 
with SSE averaging more than 160 Joules more than AVX. With the pipeline however, AVX 
surprisingly consumed on average around 51 Joules more than the SSE version.  
 
The final metric to discuss for ViVid is throughput per energy. The top configuration was TBB’s 
parallel_for running AVX code with memory tiling on four threads while the worst configuration 
was also parallel_for, but running serial code with memory tiling on one thread. The top three 
                       (a)                                                    (b)                                                 (c) 
Figure 13: OpenMP speedup of ViVid on Ultrabook (a) 
TBB Parallel_for speedup of ViVid on Ultrabook (b) 
TBB Pipeline speedup of ViVid on Ultrabook (c) 	  
	   59	  
and bottom three configurations for this metric are listed in Table 3. The full table of results can 
be found in Table 9 in Appendix A. The throughput per energy tests yielded interesting results. 
TBB’s parallel_for has five of the top ten configurations, but also has six of the ten worst 
performers. This disparity shows the great effect, positive or negative, that the threadwise 
optimizations can have on throughput per energy. A surprising result was the poor performance 
of TBB’s pipeline. The pipeline only has two of the top ten entries and has the lowest average 
throughput per energy over all its configurations. Overall, OpenMP had the highest total average 
throughput per energy, which was almost 8% more than TBB’s pipeline.  
 
Configuration Throughput per energy 
TBB Parallel_for, AVX, with memory tiling, 4 threads 0.6482 fps per Joule 
TBB Parallel_for, AVX, no memory tiling, 4 threads 0.6240 fps per Joule 
OpenMP parallel for, AVX, no memory tiling, 4 threads 0.5954 fps per Joule 
… … 
TBB Parallel_for, Serial, with memory tiling, 2 threads 0.0140 fps per Joule 
TBB Parallel_for, Serial, no memory tiling, 1 thread 0.0069 fps per Joule 
TBB Parallel_for, Serial, with memory tiling, 1 thread 0.0059 fps per Joule 
 
6.2.3 SRAD on Odroid 
We will begin our analysis of the SRAD application with a discussion of the results from the 
Odroid platform. Refer to Figure 14(a) and (b) which display the total running times from the 
two single threaded versions of SRAD.  
 
The total running time of SRAD was reduced by 67% when vector code was included. However, 
as the breakdown by kernel shows, each kernel was not reduced by the same amount. In fact, the 
majority of the reduction of execution time came from two kernels, Extraction and Compression. 
Table	  3:	  Throughput	  per	  energy	  results	  for	  ViVid	  on	  Ultrabook	  
	   60	  
While very simple in structure, each of these kernels uses a function that requires a significant 
amount of time to compute. This is clear due to the fact that the Compression kernel has only one 
memory read, one memory write, one floating-point multiplication and one logarithm and yet 
takes almost 80 seconds. On the contrary, the Preparation kernel has two memory reads, two 
memory writes, one floating-point multiplication as well, but no logarithm and yet only takes 4.6 
seconds. We are able to see a significant speedup in these two kernels when using NEON for two  
main reasons. The first is simply the fact that with NEON, four computations are being carried 
out simultaneously. The second is that the open source NEON function log_ps, which computes 
the logarithm, follows a simpler and less accurate approximation of the operation. The situation 
is the same for the Compression kernel. This loss of precision, however, was very minimal and 
was within an acceptable threshold of accuracy. Aside from these two kernels, each of the other 
four kernels experienced a 20-30% average drop in execution time when converting to NEON 
code. 
 
 
	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  (a)	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	   	   	   	   	   	  	  	  	  	  	  	  	  (b)	  
Figure	  14:	  SRAD	  total	  single	  threaded	  performance	  on	  Odroid	  (a)	  SRAD	  single	  threaded	  performance	  on	  Odroid	  per	  kernel	  (b)	  
	   61	  
 The energy consumption results, shown in Figure 15, are similar to the performance speedup 
results. The total energy usage dropped by 63% when converting to vector code. The Extraction 
and Compression kernels again experienced significant improvement over the average. The 
Extraction kernel experienced energy savings of around 82%, whereas Compression was just 
below 70%. The Extraction results may be attributed to the fact that ARM’s NEON does not 
have a divide intrinsic and floating-point divides are relatively costly in terms of energy and 
performance. However, NEON does have a reciprocal intrinsic which was used in conjunction 
with a multiply to recreate a divide. This combination of reciprocal and multiply may have added 
to the energy savings seen here. None of the other kernels experienced any significant results 
other than perhaps the Reduction kernel, which actually saw a slight energy consumption gain.  
 
Refer to Figure 16(a) and (b) for the results of the threadwise optimizations on SRAD. TBB’s 
pipeline has, yet again, shown to be the top overall threadwise optimization with respect to 
performance. The pipeline had the best overall configuration boasting 23.7 fps when running 
with NEON code on four threads. The worst overall configuration was TBB’s parallel_for. It had 
	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  (a)	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	   	   	   	   	   	  	  	  	  	  	  	  	  (b)	  
Figure	  15:	  SRAD	  total	  single	  threaded	  energy	  consumption	  on	  Ultrabook	  (a)	  SRAD	  single	  threaded	  energy	  consumption	  on	  Ultrabook	  per	  kernel	  (b)	  
	   62	  
a throughput of 1.98 fps when running non-vectorized code on one thread. The first interesting 
thing to point out is the linearity of the slope by which the performance increases when adding 
threads. This fact seems to say that the work in this application is computation bound. If the 
application was memory bound, then regardless of how many threads we had trying to access the 
memory, no improvement would be experienced. However, if we look at Figure 16(a), we see 
that adding four threads improves performance by almost 4x. Another fact that is closely related 
to this one is the similarity of the improvements regardless of the threadwise approach. Again 
referring to the serial improvements, they each yield essentially the same performance. The 
NEON version of the code is not quite as close, but there is only about a 6% difference between 
the two most distant points.  
 
We will next discuss the total energy consumption of the Odroid platform by threadwise 
optimization. Refer to Figures 17(a) and (b) for the results of the energy consumption for the 
total application. Overall, the best performer with regards to energy consumption was TBB’s 
pipeline running NEON code on three threads. This configuration had an average energy 
	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  (a)	  	   	   	   	   	   	   	  	  	  	  	  	  	  	  	  	  (b)	  
Figure	  16:	  SRAD	  multithreaded	  serial	  code	  performance	  on	  Odroid	  (a)	  SRAD	  multithreaded	  NEON	  code	  performance	  on	  Odroid	  (b)	  
	   63	  
consumption of 0.264 Joules per frame whereas the worst configuration had an energy 
consumption of 0.8 Joules per frame. This configuration was with TBB’s parallel_for when 
running the non-vectorized code on one thread. The shape of the curves in these figures closely 
resemble the bowl shaped curves from Figure 9. The energy consumption similarly drops 
significantly when increasing the number of threads from one to two in all configurations. The 
average energy consumption for the NEON code was 142.5 J and the average for the non-
vectorized code was 369 J. These figures closely resemble the amounts from Figure 15 with the 
NEON results dipping somewhat below the average found earlier.  
 
We will finish our discussion of the results from the Odroid board by discussing the throughput 
per energy results presented in Table 4. The top configuration was with TBB’s pipeline, running 
NEON code on four threads, and the worst configuration was again parallel_for with serial code 
on one thread. Unlike throughput per energy results that we have seen so far, these results were 
much closer in terms of performance across optimizations. Table 4 shows the top three and 
	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  (a)	  	   	   	   	   	   	   	  	  	  	  	  	  	  	  	  (b)	  
Figure	  17:	  SRAD	  multithreaded	  serial	  code	  energy	  consumption	  on	  Odroid	  (a)	  SRAD	  multithreaded	  NEON	  code	  energy	  consumption	  on	  Odroid	  (b)	  
	   64	  
bottom three performers in terms of throughput per energy. Just from these six entries, we can 
see a pattern that held true over almost all 24 entries in the full breakdown. Any version of 
NEON code always outperformed serial code. Within that breakdown, more threads always 
provided better throughput per energy. Finally, with the same vectorization and amount of 
threads, TBB’s Pipeline always performed better than TBB’s parallel_for, and in all but one 
instances, performed better than OpenMP. While the pattern was generally true, it is not to say 
that the results were always close. As can be seen in the difference between the top two 
configurations, there was as much as a 14.4 fps per Joule difference between two configurations. 
However, the difference was never significant enough to make a drastic change to the pattern. 
The full table of results is found in Table 10 in Appendix A. 
 
Configuration Throughput per energy 
TBB Pipeline, NEON, 4 threads 87.43 fps per Joule 
OpenMP parallel for, NEON, 4 threads 72.99 fps per Joule 
TBB Parallel_for, NEON, 4 threads 71.61 fps per Joule 
… … 
TBB Pipeline, Serial, 1 thread 2.64 fps per Joule 
OpenMP parallel for, Serial, 1 thread 2.58 fps per Joule 
TBB Parallel_for, Serial, 1 thread 2.47 fps per Joule 
 
 
6.2.4 SRAD on Ultrabook 
We will conclude this section of the discussion of the results with an analysis of running the 
SRAD application on the Ultrabook. We will first discuss the effect of performance from the 
single-threaded optimizations. These results are displayed in Figures 18(a) and (b). Not 
surprisingly, the best single-threaded configuration is when running AVX code. However, an 
Table	  4:	  Throughput	  per	  energy	  results	  for	  SRAD	  on	  Odroid	  
	   65	  
important point to notice is the small improvement that was achieved by adding vector 
instructions. In terms of total application execution time, SSE and AVX only provided a 1.2x and 
1.27x speedup, respectively. This small amount of speedup is basically due to the short running 
time of the serial version on the Ultrabook and thus, the amount of speedup gained with 
vectorization alone does not tell the entire story. When discussing the performance speedup by 
kernel, we see generally the same pattern that we saw on the Odroid with two main exceptions. 
The two exceptions are the Extraction and Compression kernels. Recall that on the Odroid, the 
running time for these two kernels improved drastically with the addition of vector instructions, 
which was the cause for much of the reduction in execution time. In the case of the Ultrabook, 
the opposite effect was experienced. In the Extraction kernel, the SSE and AVX code took 17% 
and 25% longer than the serial version, respectively. The Compression kernel took on average 
42% longer for both SSE and AVX. The cause of this seems to be the opposite of what happened 
on the Odroid. The exp and log functions for both Intel’s and ARM’s vector instructions had 
almost the exact same structure which would lead one to believe that they would have similar 
effects when added to the exact same Extraction and Compression kernels. However, the serial 
C/C++ execution of exp and log on the Ultrabook had significantly higher performance. On the 
Odroid, the Extraction and Compression kernels took 75% of the execution time of the baseline 
code. However, these same kernels only took 33% of the execution time on the Ultrabook, 
allowing for much less room for overall optimization according to Amdahl’s Law. 
 
In the next section, we will discuss the energy consumption based on the single-threaded 
optimizations. Refer to Figures 19(a) and (b) for the detailed results. Again, the AVX code 
provides the best performance. The energy consumption largely follows the same pattern as  
	   66	  
performance, which is to be expected since energy is power over time. An increase in time, 
without a subsequent decrease in average power, will result in higher energy consumption. Thus 
we can see that in the Computation Kernel 1, the SSE version of the code takes about 43% less 
time than the serial version, and it also consumes around 43% less energy, leading us to estimate 
that the average power remained the same. A minor note is that with both the Extraction and 
Compression kernels, the execution time of AVX was slightly longer than the SSE version. 
However the energy consumed in both of those filters was less. This leads us to believe that 
using AVX code, at least in the case of these two kernels, requires a lower average power and it 
was low enough that even with an increase in time, the total energy consumed was lowered. 
 
The performance results from the threadwise optimizations of SRAD on the Ultrabook are 
presented in Figure 20. The top configuration is when running AVX code with four threads. The 
throughput in this instance is 55.5 fps. The worst performing configuration is with TBB’s 
parallel_for, running serial code on one thread at a throughput of 13.3 fps. TBB’s parallel_for is  
 
	  	  	  	  	  	  (a)	  	  	   	   	   	   	   	   	   	  (b)	  
Figure	  18:	  SRAD	  total	  single	  threaded	  performance	  on	  Ultrabook	  (a)	  SRAD	  single	  threaded	  performance	  on	  Ultrabook	  per	  kernel	  (b)	  	  
	   67	  
also the worst threadwise optimization on average with respect to performance. Unlike the 
performance results from the Odroid platform, the performance speedup curve for the Ultrabook 
was not always linear. For both OpenMP and TBB’s parallel_for, the speedup was generally 
linear which leads us to believe, as was stated with the Odroid, that the application was largely 
computation bound. However, TBB’s pipeline seems to start to level out when increasing from 
three to four threads running. This suggests that with the pipeline and multiple threads, SRAD 
begins to become more memory bound. This is largely due to the fact that each thread is 
processing an individual frame, and each of the kernels uses a large amount of memory 
bandwidth. With OpenMP and parallel_for, the stress on the memory system is not as great and 
thus it is not a bottleneck. TBB’s pipeline has the best average performance gains as well as the 
single best configuration.  
 
The results of the energy consumption on the Ultrabook across the threadwise optimizations are 
shown in Figure 21. The worst configuration with respect to energy consumption was with 
TBB’s parallel_for, no vectorization and running on only one thread. This configuration  
	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  (a)	  	   	   	   	   	   	   	  	  	  	  	  	  	  	  	  	  	  	  	  	  (b)	  
Figure	  19:	  SRAD	  total	  single	  threaded	  energy	  consumption	  on	  Ultrabook	  (a)	  SRAD	  single	  threaded	  energy	  consumption	  on	  Ultrabook	  per	  kernel	  (b)	  	  
	   68	  
 
consumed 0.64 Joules per frame. The best configuration was TBB’s pipeline, running AVX code  
on four threads and consumed around a third the energy of the worst configuration at 0.23 Joules 
per frame. TBB’s pipeline yet again displayed the lowest energy consumption. Overall TBB’s 
parallel_for was the worst performer in terms of energy consumption on average. The 
parallel_for averaged just under 10% more energy consumed than OpenMP and almost 32% 
more energy than TBB’s pipeline. While the parallel_for was the worst on average, in a few 
cases TBB’s parallel_for performed similar to OpenMP and with a few configurations, it 
performed even better. An interesting fact to note is that by adding a second thread, SRAD on 
average reduced energy consumption by an average of more than 25%; more than any other 
savings produced by additional threads for any other application and platform. 
 
	  	  	  	  	  	  (a)	  	  	   	   	   	  	  	  	   	  (b)	   	   	   	   	  	  	  	  	  (c)	  
Figure	  20:	  SRAD	  multithreaded	  serial	  code	  performance	  on	  Ultrabook	  (a),	  SRAD	  multithreaded	  SSE	  code	  performance	  on	  Ultrabook	  (b),	  SRAD	  multithreaded	  AVX	  code	  performance	  on	  Ultrabook	  (c)	  	  	  
	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  (a)	  	  	   	   	   	  	  	  	   	  (b)	   	   	   	   	  	  	  	  	  (c)	  
Figure	  21:	  SRAD	  multithreaded	  serial	  code	  energy	  consumption	  on	  Ultrabook	  (a),	  SRAD	  multithreaded	  SSE	  code	  energy	  consumption	  on	  Ultrabook	  (b),	  SRAD	  multithreaded	  AVX	  code	  energy	  consumption	  on	  Ultrabook	  (c)	  	  	  
	   69	  
In terms of throughput per energy, TBB’s pipeline performed the best overall. The top 
configuration was with the pipeline, running AVX code on four threads while the worst 
configuration was yet again parallel_for with serial code on one thread. TBB’s pipeline has eight 
of the top ten configurations as well as the top six spots overall. TBB’s parallel_for, on the other 
hand, had four of the six worst performers. However, on average, OpenMP performed almost as 
poorly as parallel_for, averaging 69.9 fps per Joules where parallel_for averaged 56.8 fps per 
Joule. TBB’s pipeline performed better by far than the other two with an average of almost 121.5 
fps per Joule. The top and bottom three configurations are shown in Table 5 and the entire table 
of results is found in Table 11 in Appendix A.  
Configuration Throughput per energy 
TBB Pipeline, AVX, 4 threads 235.35 fps per Joule 
TBB Pipeline, SSE, 4 threads 216.62 fps per Joule 
TBB Pipeline, Serial, 4 threads 165.2 fps per Joule 
… … 
TBB Parallel_for, SSE, 1 thread 27.23 fps per Joule 
TBB Pipeline, Serial, 1 thread 25.83 fps per Joule 
TBB Parallel_for, Serial, 1 thread 20.93 fps per Joule 
 
 
 
 
 
 
 
 
 
Table	  5:	  Throughput	  per	  energy	  results	  for	  SRAD	  on	  Ultrabook	  
	   70	  
Chapter 7. Conclusion 
 
When discussing serial optimizations, there are two main optimizations that need mentioning: 
memory tiling and vector extensions. In general, memory tiling in the ViVid application did not 
turn out to be as beneficial as predicted. In only a few instances did it provide a performance 
improvement. There were, however, several instances where it provided energy savings, but 
overall memory tiling did not seem like a valuable optimization technique.  
 
7.1 Serial Optimization Comparison 
With regards to vector extensions, it is difficult to accurately compare them, especially across 
platforms. The serial running time and energy consumption were so different that making a 
direct comparison would not be an accurate evaluation. However, results showed that regardless 
of which extensions were chosen, any form of vectorization proved quite beneficial in all aspects 
of performance of the code. Overall, it seemed that AVX code, however, yielded both the fastest 
running times, lowest energy consumed and highest throughput per energy of the three forms of 
vectorization. This assessment is made when comparing against that platform’s serial version. 
While AVX did perform the best with respect to the three metrics, this fact must be taken with a 
grain of salt. More adjustments needed to be made to the AVX code in order to produce correct 
results. Also, to deal with edge cases, much of the AVX code was serialized rather than strictly 
using AVX vector instructions. This was done in order to improve performance since performing 
a reduction across an AVX register would be more costly than a reduction across an SSE 
register. Another point to consider is that in many cases, such as the Histogram kernel in ViVid, 
any form of vectorization would have proved to be an enormous bottleneck since memory 
	   71	  
accesses were not sequential. Also, in the Computation kernels in SRAD, some parts were left 
serialized since that particular section of the code did not lend itself well to being vectorized. In 
general, vectorization is a useful tool, but it must be used with care. 
 
 
7.2 Threadwise Optimization Comparison 
For threadwise optimizations, there were two main cases to consider: number of threads and 
parallelization technique. In terms of number of threads, having the maximum of four threads 
almost invariably was the optimum choice. Four threads averaged the fastest running times and 
lowest average energy consumption. This can be attributed to the fact that with more threads 
running, especially when discussing hyperthreading, there was less time that the hardware was 
powered on but not performing any useful function. When running either application on either 
platform, configurations having four threads were at least the top three best performers in 
throughput per energy category.  
 
Overall, the best parallelization technique was TBB’s pipeline. The pipeline was constantly the 
best performer according to all three metrics in both applications on the Odroid as well as in 
SRAD on the Ultrabook. OpenMP proved to perform better while running ViVid on the 
Ultrabook, but only by less than 5% better than the pipeline in all metrics. TBB’s parallel_for 
was clearly the worst performer. TBB’s parallel_for averaged around 30% more energy 
consumed and 40% longer execution time than the pipeline for both applications on both 
platforms.  
 
	   72	  
7.3 Platform Comparison 
When comparing two platforms, there are many things to consider such as price, 
programmability, setup time and effort amongst other factors. The comparison of the two 
platforms used in this report will only be measured on the three metrics detailed in the Results 
section, but all of the previously mentioned factors must be taken into consideration when 
purchasing a platform. Tables 6 and 7 have a number of results with which we will use to 
compare the two platforms. Each table has the results from both the best and worst 
configurations for all three metrics in both applications on both platforms. Table 6 has these 
results recorded when using the -O0 and /Od compiler flags while Table 7 has the results from 
runs using the -O2 and /O2 compiler flags, which optimize for speed. The results in the table 
represent configurations that included some form of threadwise optimization, whether it be 
OpenMP or TBB. Code that did not use OpenMP or TBB was not included. In Table 7, a * 
indicates a configuration that changed by including the -O2 or /O2. Along with the numerical 
results are the actual configurations themselves. Beneath the numerical results in Table 7 are the 
improvements over those found in Table 6 in terms of overall improvement.  
 
7.3.1 SRAD Comparison 
To begin the comparison, we will first discuss the SRAD application. The Ultrabook was clearly 
the better overall performing platform for this application. For each metric, the Ultrabook’s best 
performance was always better than the Odroid’s with or without the compiler optimization flags 
enabled; the worst performance was never as poor as the Odroid and the average performance 
always higher. In fact, the worst throughput for the Ultrabook was almost as high as the best 
possible performance by the Odroid. In terms of improvement over the -O0 and /O0 versions, the 
	   73	  
Odroid saw almost no improvement by enabling the compiler optimization as opposed to the 
Ultrabook, which experienced at least a 1.5x improvement in all but one metric.  
 
7.3.2 ViVid Comparison 
The comparison for the ViVid application is much closer. Again the Ultrabook outperforms the 
Odroid in terms of throughput boasting a best performance of more than 2x that of the Odroid 
with no compiler flags and almost 3x better with compiler flags enabled. The Odroid did make 
up some ground as far as worst time going from almost 4x slower to only around 2x slower than 
the Ultrabook worst. Where the Odroid shines in this comparison is in energy. The Ultrabook has 
a lower best energy performance by about 20% less than the Odroid, but in terms of worst 
performance with energy consumption, the Odroid uses only 40% of what the Ultrabook does. 
While the best performing configuration on the Ultrabook was better than the Odroid, the 
average energy consumed when looking at all of the evaluated configurations on the Ultrabook 
was almost 50% higher than that on the Odroid. This pattern holds fairly steady with or without 
the compiler flags. Despite the impressive energy savings, the Ultrabook still seems to be the 
better platform for ViVid. The savings experienced in energy on the Odroid are not proportional 
to the losses in performance. This is shown by throughput per energy where the Ultrabook 
processes more than 2x the frames per second per Joule than the Odroid with and without the 
compiler flags. 
 
 
 
 
	   74	  
Table	  6:	  Best	  and	  worst	  results	  using	  -­‐O0	  and	  /O0	  
 
-O0 and /O0 BEST WORST 
Value Configuration Value Configuration 
ViVid 
on 
Odroid 
Throughput 
(frames/second) 1.26 
TBB Pipeline,  
NEON, no Mem Tile,  
4 threads  
0.11 
TBB Parallel_for,  
Serial, no Mem Tile,  
1 thread 
Energy 
(Joules/frame) 5.05 
TBB Pipeline,  
NEON, Mem Tile,  
2 threads  
15.8 
TBB Parallel_for,  
Serial, no Mem Tile,  
1 thread 
Throughput per 
energy 
(fps/Joule) 
0.23 
OpenMP parallel for,  
NEON, no Mem Tile,  
4 threads  
0.007 
TBB Parallel_for,  
Serial, no Mem Tile,  
1 thread 
ViVid 
on 
Ultrabook 
Throughput 
(frames/second) 2.69 
TBB Parallel_for,  
AVX, Mem Tile,  
4 threads 
0.43 
TBB Parallel_for,  
Serial, Mem Tile,  
1 thread 
Energy 
(Joules/frame) 4.15 
TBB Parallel_for,  
AVX, Mem Tile,  
4 threads 
39.23 
TBB Parallel_for,  
Serial, Mem Tile,  
1 thread 
Throughput per 
energy 
(fps/Joule) 
0.648 
TBB Parallel_for,  
AVX, Mem Tile,  
4 threads 
0.006 
TBB Parallel_for,  
Serial, Mem Tile,  
1 thread 
SRAD 
on 
Odroid 
Throughput 
(frames/second) 23.7 
TBB Pipeline,  
NEON, 
4 threads  
1.98 
TBB Parallel_for,  
Serial,  
1 thread 
Energy 
(Joules/frame) 0.264 
TBB Pipeline,  
NEON, 
4 threads  
0.8 
TBB Parallel_for,  
Serial,  
1 thread 
Throughput per 
eenergy 
(fps/Joule) 
87.43 
TBB Pipeline,  
NEON, 
4 threads  
2.47 
TBB Parallel_for,  
Serial,  
1 thread 
SRAD 
on 
Ultrabook 
Throughput 
(frames/second) 55.5 
TBB Parallel_for,  
AVX, 
4 threads 
13.3 
TBB Parallel_for,  
Serial,  
1 thread 
Energy 
(Joules/frame) 0.23 
TBB Parallel_for,  
AVX, 
4 threads 
0.64 
TBB Parallel_for,  
Serial,  
1 thread 
Throughput per 
energy 
(fps/Joule) 
235.4 
TBB Parallel_for,  
AVX, 
4 threads 
20.93 
TBB Parallel_for,  
Serial,  
1 thread 
 
 
 
 
 
	  
	   75	  
Table	  7:	  Best	  and	  worst	  results	  using	  -­‐02	  and	  /O2	  with	  the	  improvement	  in	  best	  and	  worst	  performance	  over	  using	  -­‐O0	  and	  /O0	  in	  parentheses	  below	  
-O2 and /O2 
BEST WORST 
Value 
(Improvement) Configuration 
Value 
(Improvement) Configuration 
ViVid 
on 
Odroid 
Throughput 
(frames/second) 
5.69 
(4.52x) 
TBB Pipeline,  
NEON,Mem Tile,  
4 threads* 
0.31 
(2.82x) 
TBB Parallel_for, 
Serial, no Mem Tile,  
1 thread 
Energy 
(Joules/frame) 
1.14 
(4.43x) 
OpenMP parallel for, 
NEON, Mem Tile,  
3 threads  
4.98 
(3.17x) 
TBB Parallel_for,  
Serial, no Mem Tile,  
1 thread 
Throughput per 
energy 
(fps/Joule) 
4.7 
(20.43x) 
OpenMP parallel for,  
NEON, no Mem Tile,  
4 threads  
0.06 
(8.96x) 
TBB Parallel_for,  
Serial, no Mem Tile,  
1 thread 
ViVid 
on 
Ultrabook 
Throughput 
(frames/second) 
16.3 
(6.06x) 
TBB Pipeline, SSE, 
no Mem Tile, 
 4 threads*  
0.66 
(1.53x) 
TBB Parallel_for,  
Serial, no Mem Tile,  
1 thread* 
Energy 
(Joules/frame) 
0.77 
(5.39x) 
TBB Pipeline, SSE, 
Mem Tile,  
4 threads*  
14.3 
(2.74x) 
TBB Parallel_for,  
Serial, no Mem Tile,  
1 thread* 
Throughput per 
energy 
(fps/Joule) 
20.89 
(32.23x) 
TBB Pipeline, SSE, 
Mem Tile,  
4 threads*  
0.05 
(8.47x) 
TBB Parallel_for,  
Serial, no Mem Tile,  
1 thread* 
SRAD 
on 
Odroid 
Throughput 
(frames/second) 
23.88 
(1.01x) 
TBB Pipeline,  
NEON, 
4 threads  
1.99 
(1.01x) 
TBB Parallel_for,  
Serial,  
1 thread 
Energy 
(Joules/frame) 
0.27 
(1.02x) 
TBB Pipeline,  
NEON, 
3 threads* 
0.8 
(1.00x) 
TBB Parallel_for,  
Serial,  
1 thread 
Throughput per 
energy 
(fps/Joule) 
85.19 
(1.03x) 
TBB Pipeline,  
NEON, 
4 threads  
2.48 
(1.00x) 
TBB Parallel_for,  
Serial,  
1 thread 
SRAD 
on 
Ultrabook 
Throughput 
(frames/second) 
86.02 
(1.55x) 
TBB Pipeline,  
AVX 
3 threads* 
20.0 
(1.5x) 
OpenMP parallel 
for, Serial,  
1 thread* 
Energy 
(Joules/frame) 
0.13 
(1.77x) 
TBB Pipeline,  
AVX 
2 threads* 
0.47 
(1.36x) 
OpenMP parallel 
for, Serial,  
1 thread* 
Throughput per 
energy 
(fps/Joule) 
640.65 
(2.72x) 
TBB Pipeline,  
AVX 
2 threads* 
48.09 
(2.3x) 
OpenMP parallel 
for, Serial,  
1 thread* 
 
 
 
 
 
	   76	  
Chapter 8. Future work 
This work has many opportunities for enhancement. Since this is a work utilizing the 
heterogeneous architecture of these platforms, one possible extension is to incorporate the GPU 
in the processing of these applications. In [6], the GPU was used to help speed up the processing 
of ViVid on the Ultrabook. In a similar manner, ViVid on the Odroid could be extended in a 
similar way to explore potential benefits. The SRAD application could also incorporate the usage 
of the GPU for both the Ultrabook and Odroid, especially in the two Computation kernels. 
Another possible future work would be to evaluate the applications on other platforms. A 
popular trend in industry recently has been the addition of an on chip GPU, such as the 
Qualcomm Snapdragon s4 [20], the AMD APU [21] and the NVidia Tegra [22], which would 
make this work more broad in scope. Finally, TBB’s pipeline was able to perform so well on 
these applications due to the frame-level independence in both ViVid and SRAD. An interesting 
extension to this work would be to perform similar optimizations to different applications that 
have varying levels of independence. The independence of elements in a frame as well as frames 
themselves allowed for essentially all of the optimizations made here, both serial and threadwise. 
Other applications that had more dependence between frames, or internal kernels or even the 
elements themselves could possibly yield interesting results when attempting to parallelize.  
 
 
 
 
 
 
	   77	  
References 
[1]  GCC Team, “GCC, the GNU Compiler Collection,” last modified June 2, 2014. [Online]. 
Available: https://gcc.gnu.org/ 
 
[2] Intel Corporation, “Intel® C and C++ Compilers,” [Online]. Available: 
https://software.intel.com/en-us/intel-compilers 
 
[3] Intel Corporation, “Intel® Thread Building Blocks 4.2 Update 5,” [Online] Available:  
https://www.threadingbuildingblocks.org/ 
 
[4] OpenMP Architecture Review Board, “OpenMP Application Program Interface Version 
4.0,” last modified July 2013. [Online]. Available: http://www.openmp.org/mp-
documents/OpenMP4.0.0.pdf 
 
[5] Khronos Group, “OpenCL. The open standard for parallel programming of heterogeneous 
systems,” [Online]. Available: https://www.khronos.org/opencl/ 
 
[6]  E. Totoni, M. Dikmen, and M. J. Garzaran, “Easy, fast, and energy-efficient object 
detection on heterogeneous on-chip architectures,” ACM Transactions on Architecture 
and Code Optimization, vol. 10, no. 4, article 45, 2013. 
 
[7] T. Willhalm, “Intel® Performance Counter Monitor - A better way to measure CPU 
utilization,” last modified August 16, 2012. [Online]. Available:   
https://software.intel.com/en-us/articles/intel-performance-counter-monitor-a-better-way-
to-measure-cpu-utilization 
 
[8] The Open Group, “open(3) - Linux man page,” [Online]. Available: 
http://linux.die.net/man/3/open 
 
[9] The Open Group, “usleep(3) - Linux man page,” [Online]. Available: 
http://linux.die.net/man/3/usleep 
 
[10] L.M. Sanchez, J. Fernandez, R. Sotomayor, S. Escolar, and J.D. Garcia, “A comparative 
study and evaluation of parallel programming models for shared-memory parallel 
architectures,” New Generation Computing, vol. 31, no. 3, pp. 139-161, July 2013. 
 
[11] P.D. Michailidis and K.G. Margaritis, “Implementing Basic Computational Kernels of 
Linear Algebra on Multicore,” 2012 16th Panhellenic Conference on Informatics (PCI 
2012), pp. 217-222. 
 
[12] L. Lacassagne, D. Eteiemble, A.H. Zahraee, A. Dominguez, and P. Vezolle, “High level 
transforms for SIMD and low-level computer vision algorithms,” 2014 1st ACM 
SIGPLAN Workshop on Programming Models for SIMD/Vector Processing, WPMVP 
2014, pp. 49-56. 
	   78	  
[13] G. Mitra, B. Johnson,  A.P. Rendell, E. McCreath, and J. Zhou, “Use of SIMD vector 
operations to accelerate application code performance on low-powered ARM and intel 
platforms,” in Proc. IEEE 27th International Parallel and Distributed Processing 
Symposium Workshops and PhD Forum, IPDPSW, 2013, pp. 1107-1116. 
 
[14] J. Wan, R.G. Wang, H. Lv, L. Zhang, W.M. Wang, C.C. Gu, Q.Z. Zheng, and W. Gao, 
“AVS video decoding acceleration on ARM Cortex-A with NEON,” in Signal 
Processing, Communication and Computing (ICSPCC), 2012 IEEE International 
Conference on, 2012, pp. 290-294. 
 
[15] A. Heinecke, M. Klemm, H. Pabst, and D. Pflüger, “Toward high-performance 
implementations of a custom HPC kernel using® array building blocks,” Facing the 
Multicore-Challenge II, pp. 36-47, 2012. 
 
[16] N. Rajovic, A. Rico, J. Vipond, I. Gelado, N. Puzovic and A. Ramirez, “Experiences with 
mobile processors for energy efficient HPC,” in Design, Automation & Test in Europe 
Conference& Exhibition (DATE), pp. 464–468, 2013. 
 
[17] N. Rajovic, P.M. Carpenter, I. Gelado, N. Puzovic, A. Ramirez and M. Valero, 
“Supercomputing with commodity CPUs: Are mobile SoCs ready for HPC?” in Proc. 
International Conference for High Performance Computing, Networking, Storage and 
Analysis (SC ’13), 2013.   
 
[18] “Simple SSE and SSE2 (and now NEON) optimized sin, cos, log and exp,” last modified 
May 29, 2011. {online]. Available: gruntthepeon.free.fr/ssemath/ 
 
[19] R. Gonzalez and M. Horowitz, “Energy dissipation in general purpose microprocessors,” 
IEEE Journal of Solid-State Circuits, vol. 31, no. 9, pp. 1277-1284, 1996. 
 
[20] J. Bausch, “HTC uses NASA technology to toughen up its phones: Unique concept 
makes the new One S phone the most durable device on the market,” Electronic Products, 
vol. 54, no. 5, May 2012. 
 
[21] D. Foley, P. Bansal, D. Cherepacha, R. Wasmuth, A. Gunasekar, S. Gutta, and A. Naini, 
“A low-power integrated x86-64 and graphics processor for mobile computing devices,” 
IEEE Journal of Solid-State Circuits, vol. 47, no. 1, pp. 220–231, 2012. 
 
[22] E. Lindholm, J. Nickolls, S. Oberman, and J. Montrym, “NVIDIA Tesla: A Unified 
Graphics and Computing Architecture,” Micro, IEEE vol. 28, no. 2, pp. 39–55, 2008. 
 
 
 
 
	   79	  
Appendix A. Throughput per Energy Tables 
Table 8: Throughput per Energy results for ViVid on Odroid 
Configuration Throughput per energy 
OpenMP parallel for, NEON, no memory tiling, 4 threads 0.23 fps per Joule 
TBB Pipeline, NEON, no memory tiling, 4 threads 0.2288 fps per Joule 
OpenMP parallel for, NEON, memory tiling, 4 threads 0.2103 fps per Joule 
TBB Pipeline, NEON, memory tiling, 3 threads 0.2084 fps per Joule 
TBB Parallel_for, NEON, no memory tiling, 4 threads 0.195 fps per Joule 
TBB Pipeline, NEON, memory tiling, 4 threads 0.191 fps per Joule 
OpenMP parallel for, NEON, no memory tiling, 3 threads 0.1887 fps per Joule 
TBB Pipeline, NEON, no memory tiling, 3 threads 0.1877 fps per Joule 
TBB Parallel_for, NEON, memory tiling, 4 threads 0.1676 fps per Joule 
OpenMP parallel for, NEON, memory tiling, 3 threads 0.1539 fps per Joule 
TBB Parallel_for, NEON, no memory tiling, 3 threads 0.1499 fps per Joule 
TBB Parallel_for, NEON, memory tiling, 3 threads 0.1446 fps per Joule 
TBB Pipeline, NEON, memory tiling, 2 threads 0.1425 fps per Joule 
TBB Pipeline, NEON, no memory tiling, 2 threads 0.1272 fps per Joule 
OpenMP parallel for, NEON, no memory tiling, 2 threads 0.1225 fps per Joule 
OpenMP parallel for, NEON, memory tiling, 2 threads 0.1104 fps per Joule 
TBB Parallel_for, NEON, no memory tiling, 2 threads 0.1052 fps per Joule 
TBB Parallel_for, NEON, memory tiling, 2 threads 0.1013 fps per Joule 
TBB Pipeline, Serial, no memory tiling, 4 threads 0.0627 fps per Joule 
TBB Pipeline, Serial, memory tiling, 4 threads 0.0587 fps per Joule 
TBB Pipeline, NEON, memory tiling, 1 thread 0.0585 fps per Joule 
OpenMP parallel for, NEON, memory tiling, 1 thread 0.0579 fps per Joule 
TBB Pipeline, Serial, no memory tiling, 1 thread 0.0579 fps per Joule 
OpenMP parallel for, Serial, no memory tiling, 1 thread 0.0569 fps per Joule 
TBB Pipeline, Serial, no memory tiling, 3 threads 0.0507 fps per Joule 
TBB Parallel_for, Serial, no memory tiling, 1 thread 0.049 fps per Joule 
TBB Pipeline, Serial, memory tiling, 3 threads 0.0487 fps per Joule 
TBB Parallel_for, NEON, memory tiling, 1 thread 0.0473 fps per Joule 
OpenMP parallel for, Serial, no memory tiling, 4 threads 0.0373 fps per Joule 
OpenMP parallel for, Serial, memory tiling, 4 threads 0.0355 fps per Joule 
TBB Pipeline, Serial, no memory tiling, 2 threads 0.0346 fps per Joule 
TBB Pipeline, Serial, memory tiling, 2 threads 0.0332 fps per Joule 
OpenMP parallel for, Serial, no memory tiling, 3 threads 0.0328 fps per Joule 
OpenMP parallel for, Serial, memory tiling, 3 threads 0.0285 fps per Joule 
TBB Parallel_for, Serial, memory tiling, 4 threads 0.0281 fps per Joule 
TBB Parallel_for, Serial, no memory tiling, 4 threads 0.0277 fps per Joule 
OpenMP parallel for, Serial, no memory tiling, 2 threads 0.0229 fps per Joule 
OpenMP parallel for, Serial, memory tiling, 2 threads 0.021 fps per Joule 
TBB Parallel_for, Serial, memory tiling, 3 threads 0.0208 fps per Joule 
	   80	  
Table 8: Cont. 
 
TBB Parallel_for, Serial, no memory tiling, 3 threads 0.0203 fps per Joule 
TBB Pipeline, Serial, no memory tiling, 1 thread 0.0162 fps per Joule 
TBB Parallel_for, Serial, memory tiling, 2 threads 0.0157 fps per Joule 
TBB Pipeline, Serial, memory tiling, 1 thread 0.0154 fps per Joule 
TBB Parallel_for, Serial, no memory tiling, 2 threads 0.0152 fps per Joule 
OpenMP parallel for, Serial, memory tiling, 1 thread 0.0118 fps per Joule 
OpenMP parallel for, Serial, no memory tiling, 1 thread 0.0109 fps per Joule 
TBB Parallel_for, Serial, memory tiling, 1 thread 0.0073 fps per Joule 
TBB Parallel_for, Serial, no memory tiling, 1 thread 0.0067 fps per Joule 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
	   81	  
Table 9: Throughput per Energy results for ViVid on Ultrabook 
Configuration Throughput per energy 
TBB Parallel_for, AVX, memory tiling, 4 threads 0.6482 fps per Joule 
TBB Parallel_for, AVX, no memory tiling, 4 threads 0.624 fps per Joule 
OpenMP parallel for, AVX, no memory tiling, 4 threads 0.5954 fps per Joule 
TBB Pipeline, Serial, memory tiling, 4 threads 0.5493 fps per Joule 
TBB Pipeline, NEON, memory tiling, 4 threads 0.5288 fps per Joule 
OpenMP parallel for, Serial, memory tiling, 4 threads 0.5036 fps per Joule 
TBB Parallel_for, AVX, no memory tiling, 3 threads 0.4833 fps per Joule 
OpenMP parallel for, AVX, no memory tiling, 3 threads 0.4758 fps per Joule 
TBB Parallel_for, AVX, memory tiling, 3 threads 0.4666 fps per Joule 
TBB Parallel_for, Serial, memory tiling, 4 threads 0.4587 fps per Joule 
OpenMP parallel for, AVX, memory tiling, 4 threads 0.4561 fps per Joule 
TBB Pipeline, Serial, memory tiling, 3 threads 0.4447 fps per Joule 
OpenMP parallel for, Serial, memory tiling, 3 threads 0.4216 fps per Joule 
TBB Pipeline, NEON, memory tiling, 3 threads 0.4203 fps per Joule 
TBB Parallel_for, AVX, no memory tiling, 2 threads 0.4154 fps per Joule 
TBB Pipeline, AVX, no memory tiling, 4 threads 0.4039 fps per Joule 
TBB Pipeline, AVX, memory tiling, 4 threads 0.4008 fps per Joule 
TBB Parallel_for, Serial, memory tiling, 3 threads 0.3885 fps per Joule 
OpenMP parallel for, AVX, no memory tiling, 2 threads 0.3656 fps per Joule 
OpenMP parallel for, SSE, memory tiling, 4 threads 0.3547 fps per Joule 
TBB Parallel_for, AVX, memory tiling, 2 threads 0.3532 fps per Joule 
TBB Pipeline, AVX, no memory tiling, 3 threads 0.3502 fps per Joule 
TBB Pipeline, AVX, memory tiling, 3 threads 0.3369 fps per Joule 
TBB Pipeline, Serial, memory tiling, 2 threads 0.322 fps per Joule 
TBB Parallel_for, NEON, memory tiling, 4 threads 0.3184 fps per Joule 
OpenMP parallel for, Serial, memory tiling, 2 threads 0.313 fps per Joule 
OpenMP parallel for, AVX, memory tiling, 2 threads 0.308 fps per Joule 
OpenMP parallel for, AVX, memory tiling, 3 threads 0.3059 fps per Joule 
OpenMP parallel for, SSE, memory tiling, 3 threads 0.2807 fps per Joule 
TBB Parallel_for, Serial, memory tiling, 2 threads 0.2575 fps per Joule 
TBB Parallel_for, NEON, memory tiling, 3 threads 0.2435 fps per Joule 
TBB Pipeline, AVX, memory tiling, 2 threads 0.2359 fps per Joule 
TBB Pipeline, AVX, no memory tiling, 2 threads 0.2319 fps per Joule 
OpenMP parallel for, SSE, memory tiling, 2 threads 0.2194 fps per Joule 
TBB Parallel_for, NEON, memory tiling, 2 threads 0.2066 fps per Joule 
TBB Pipeline, NEON, memory tiling, 2 threads 0.1972 fps per Joule 
OpenMP parallel for, AVX, no memory tiling, 1 thread 0.1849 fps per Joule 
 
 
	   82	  
Table 9: Cont. 
 
OpenMP parallel for, AVX, memory tiling, 1 thread 0.1775 fps per Joule 
TBB Parallel_for, AVX, no memory tiling, 1 thread 0.1724 fps per Joule 
TBB Parallel_for, AVX, memory tiling, 1 thread 0.1648 fps per Joule 
TBB Pipeline, Serial, memory tiling, 1 thread 0.1464 fps per Joule 
OpenMP parallel for, Serial, memory tiling, 1 thread 0.1417 fps per Joule 
OpenMP parallel for, SSE, memory tiling, 1 thread 0.1225 fps per Joule 
TBB Parallel_for, Serial, memory tiling, 1 thread 0.1094 fps per Joule 
TBB Pipeline, AVX, no memory tiling, 1 thread 0.1026 fps per Joule 
TBB Pipeline, AVX, memory tiling, 1 thread 0.0986 fps per Joule 
OpenMP parallel for, Serial, no memory tiling, 4 threads 0.0903 fps per Joule 
TBB Pipeline, NEON, memory tiling, 1 thread 0.0899 fps per Joule 
TBB Pipeline, Serial, no memory tiling, 4 threads 0.0894 fps per Joule 
TBB Pipeline, NEON, no memory tiling, 4 threads 0.0811 fps per Joule 
TBB Parallel_for, NEON, memory tiling, 1 thread 0.0806 fps per Joule 
OpenMP parallel for, Serial, no memory tiling, 3 threads 0.0726 fps per Joule 
TBB Pipeline, Serial, no memory tiling, 3 threads 0.0726 fps per Joule 
TBB Pipeline, NEON, no memory tiling, 3 threads 0.0701 fps per Joule 
OpenMP parallel for, SSE, no memory tiling, 4 threads 0.0641 fps per Joule 
TBB Pipeline, Serial, no memory tiling, 2 threads 0.055 fps per Joule 
OpenMP parallel for, Serial, no memory tiling, 2 threads 0.0536 fps per Joule 
OpenMP parallel for, SSE, no memory tiling, 3 threads 0.0519 fps per Joule 
TBB Pipeline, NEON, no memory tiling, 2 threads 0.0501 fps per Joule 
OpenMP parallel for, SSE, no memory tiling, 2 threads 0.0417 fps per Joule 
OpenMP parallel for, Serial, no memory tiling, 1 thread 0.0236 fps per Joule 
TBB Parallel_for, Serial, no memory tiling, 4 threads 0.0229 fps per Joule 
TBB Parallel_for, Serial, no memory tiling, 3 threads 0.0214 fps per Joule 
TBB Pipeline, Serial, no memory tiling, 1 thread 0.0206 fps per Joule 
TBB Parallel_for, NEON, no memory tiling, 4 threads 0.0204 fps per Joule 
OpenMP parallel for, Serial, no memory tiling, 1 thread 0.0196 fps per Joule 
TBB Parallel_for, NEON, no memory tiling, 3 threads 0.0187 fps per Joule 
TBB Pipeline, Serial, no memory tiling, 1 thread 0.0178 fps per Joule 
TBB Parallel_for, Serial, no memory tiling, 2 threads 0.0158 fps per Joule 
TBB Parallel_for, NEON, no memory tiling, 2 threads 0.014 fps per Joule 
TBB Parallel_for, Serial, no memory tiling, 1 thread 0.0069 fps per Joule 
TBB Parallel_for, Serial, no memory tiling, 1 thread 0.0059 fps per Joule 
 
 
 
 
 
	   83	  
Table 10: Throughput per Energy results for SRAD on Odroid 
Configuration Throughput per energy 
TBB Pipeline, NEON, 4 threads 87.43 fps per Joule 
OpenMP parallel for, NEON, 4 threads 72.99 fps per Joule 
TBB Parallel_for, NEON, 4 threads 71.61 fps per Joule 
TBB Pipeline, NEON, 3 threads 68.99 fps per Joule 
OpenMP parallel for, NEON, 3 threads 58.38 fps per Joule 
TBB Parallel_for, NEON, 3 threads 56.74 fps per Joule 
TBB Pipeline, NEON, 2 threads 46.37 fps per Joule 
OpenMP parallel for, NEON, 2 threads 41.03 fps per Joule 
TBB Parallel_for, NEON, 2 threads 39.81 fps per Joule 
TBB Pipeline, NEON, 1 thread 21.27 fps per Joule 
TBB Parallel_for, NEON, 1 thread 19.59 fps per Joule 
OpenMP parallel for, NEON, 1 thread 19.45 fps per Joule 
TBB Pipeline, Serial, 4 threads 11.14 fps per Joule 
TBB Parallel_for, Serial, 4 threads 10.12 fps per Joule 
OpenMP parallel for, Serial, 4 threads 10.02 fps per Joule 
TBB Pipeline, Serial, 3 threads 8.67 fps per Joule 
OpenMP parallel for, Serial, 3 threads 8.18 fps per Joule 
TBB Parallel_for, Serial, 3 threads 7.89 fps per Joule 
OpenMP parallel for, Serial, 2 threads 5.51 fps per Joule 
TBB Pipeline, Serial, 2 threads 5.39 fps per Joule 
TBB Parallel_for, Serial, 2 threads 5.35 fps per Joule 
TBB Pipeline, Serial, 1 thread 2.64 fps per Joule 
OpenMP parallel for, Serial, 1 thread 2.58 fps per Joule 
TBB Parallel_for, Serial, 1 thread 2.47 fps per Joule 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
	   84	  
Table 11: Throughput per Energy results for SRAD on Ultrabook 
Configuration Throughput per energy 
OpenMP parallel for, Serial, 1 thread 235.35 fps per Joule 
OpenMP parallel for, Serial, 2 threads 216.62 fps per Joule 
OpenMP parallel for, Serial, 3 threads 165.2 fps per Joule 
OpenMP parallel for, Serial, 4 threads 164.81 fps per Joule 
OpenMP parallel for, SSE, 1 thread 163.82 fps per Joule 
OpenMP parallel for, SSE, 2 threads 126.99 fps per Joule 
OpenMP parallel for, SSE, 3 threads 120.85 fps per Joule 
OpenMP parallel for, SSE, 4 threads 117.92 fps per Joule 
OpenMP parallel for, AVX, 1 thread 117.67 fps per Joule 
OpenMP parallel for, AVX, 2 threads 106.85 fps per Joule 
OpenMP parallel for, AVX, 3 threads 103.76 fps per Joule 
OpenMP parallel for, AVX, 4 threads 96.47 fps per Joule 
TBB Parallel_for, Serial, 1 thread 89.9 fps per Joule 
TBB Parallel_for, Serial, 2 threads 88.48 fps per Joule 
TBB Parallel_for, Serial, 3 threads 84.67 fps per Joule 
TBB Parallel_for, Serial, 4 threads 81.3 fps per Joule 
TBB Parallel_for, SSE, 1 thread 76.35 fps per Joule 
TBB Parallel_for, SSE, 2 threads 75.32 fps per Joule 
TBB Parallel_for, SSE, 3 threads 69.72 fps per Joule 
TBB Parallel_for, SSE, 4 threads 62.52 fps per Joule 
TBB Parallel_for, AVX, 1 thread 58.58 fps per Joule 
TBB Parallel_for, AVX, 2 threads 57.99 fps per Joule 
TBB Parallel_for, AVX, 3 threads 52.9 fps per Joule 
TBB Parallel_for, AVX, 4 threads 49.27 fps per Joule 
TBB Pipeline, Serial, 1 thread 48.99 fps per Joule 
TBB Pipeline, Serial, 2 threads 38.75 fps per Joule 
TBB Pipeline, Serial, 3 threads 36.81 fps per Joule 
TBB Pipeline, Serial, 4 threads 35.99 fps per Joule 
TBB Pipeline, SSE, 1 thread 35.58 fps per Joule 
TBB Pipeline, SSE, 2 threads 35.16 fps per Joule 
TBB Pipeline, SSE, 3 threads 32.61 fps per Joule 
TBB Pipeline, SSE, 4 threads 28.84 fps per Joule 
TBB Pipeline, AVX, 1 thread 27.69 fps per Joule 
TBB Pipeline, AVX, 2 threads 27.23 fps per Joule 
TBB Pipeline, AVX, 3 threads 25.83 fps per Joule 
TBB Pipeline, AVX, 4 threads 20.93 fps per Joule 
 
