Western Michigan University

ScholarWorks at WMU
Dissertations

Graduate College

4-2020

Efficient Hardware/Software Partitioning Techniques for a Cloudscale CPU-FPGA Platform
Samah Ziyad Rahamneh
Western Michigan University, samahrahamneh@gmail.com

Follow this and additional works at: https://scholarworks.wmich.edu/dissertations
Part of the Computer and Systems Architecture Commons, and the Hardware Systems Commons

Recommended Citation
Rahamneh, Samah Ziyad, "Efficient Hardware/Software Partitioning Techniques for a Cloud-scale CPUFPGA Platform" (2020). Dissertations. 3598.
https://scholarworks.wmich.edu/dissertations/3598

This Dissertation-Open Access is brought to you for free
and open access by the Graduate College at
ScholarWorks at WMU. It has been accepted for inclusion
in Dissertations by an authorized administrator of
ScholarWorks at WMU. For more information, please
contact wmu-scholarworks@wmich.edu.

Efficient Hardware/Software Partitioning Techniques for a Cloud-scale
CPU-FPGA Platform

by
Samah Ziyad Rahamneh

A dissertation submitted to the Graduate College
in partial fulfillment of the requirements
for the degree of Doctor of Philosophy
Electrical and Computer Engineering
Western Michigan University
April 2020

Doctoral Committee:
Lina Sawalha, Ph.D., Chair
Jonshon Asumadu, Ph.D.
Janos Grantner, Ph.D.
Alvis Fong, Ph.D.

Copyright by
Samah Ziyad Rahamneh
2020

Efficient Hardware/Software Partitioning Techniques for a Cloud-scale
CPU-FPGA Platform

Samah Ziyad Rahamneh, Ph.D.
Western Michigan University, 2020

The diversity of workload characteristics has stimulated the deployment of heterogeneous
architectures to accommodate workloads’ requirements disparity in cloud data centers. In
heterogeneous computing, co-processors are utilized to support Central Processing Units
(CPUs) in fulfilling workload demands. Field Programmable Gate Arrays (FPGAs) have advantages over other accelerators because of their power, performance and re-configurability
benefits. In order to achieve the most benefit of a heterogeneous platform, efficient partitioning of workload between the CPU and the FPGA is a crucial demand.
This dissertation first presents a design and implementation of cooperative CPU-FPGA
execution techniques, which include code and data partitioning, of an image processing algorithm on Intel’s Hardware Research Acceleration Program (HARP). The data partitioning
outperforms both a CPU-only and a FPGA-only implementations by up to 4.8X and 2.1X respectively. It also results in a 55.3% reduction in energy consumption, on average, compared
to the CPU-only implementation. The code partitioning resulted in up to 2.3X speedup
compared to a CPU-only implementation and improved system utilization.
The dissertation also presents an automatic hardware/software partitioning of cloudscale applications such as the k-means algorithm, the Canny algorithm, and the Advanced
Encryption (AES ) algorithm on HARP. Particle Swarm Optimization (PSO) and Genetic
Algorithm (GA) were used to partition these applications leveraging a multi-objective utility
function. The accuracy and the execution time of PSO depend to a large extent on its

parameters. However, generally accepted fixed value parameters are used by researchers
and practitioners. In this study, a machine learning-based tuning technique for the PSO
parameters was proposed and implemented. The results show an improvement in PSO
accuracy by up to 62.9% and in its execution time by up to 29%. Moreover, aiming at
mitigating the effect of the premature convergence problem that GA and PSO suffer from.
The PSO algorithm is extended with a distributed greedy search technique. This approach
improves the accuracy of PSO by up to 55.4%. GA also was extended with the distributed
local search technique, which improved the accuracy of GA by up to 82.6%.
Finally, we propose and implement a variation of the PSO algorithm that partitions the
code and the data of an application between the CPU and the FPGA by assigning some
nodes to both devices with different data sets. This partitioning approach improves the
accuracy of PSO by up to 33% for a data parallel application, Canny.

ACKNOWLEDGEMENTS
I would like to express my sincere thanks to Dr. Lina Sawalha, who engineered my
PhD studies, supported me with her knowledge. Prof. Johnson Asumadu, who’s input
empowered my PhD dissertation. Prof. Janos Grantner, who’s input empowered my PhD
dissertation. Dr. Alvis Fong, who empowered my critical thinking and problem formulation
skills. I would like also to thank the University of Jordan, Amman - Jordan, for their
financial support. Faculty and Staff of Department of Electrical and Computer Engineering,
WMU. The Graduate College represented by Dr. Christine Byrd-Jacobs. National Science
Foundation (NSF) as this material is based upon work supported by NSF under grant No.
1821691, and Intel Inc. for their equipment access.

Samah Ziyad Rahamneh

ii

DEDICATION
To my father Ziyad Rahamneh and my mother Bahieh Albqour, for their encouragement
and prayers during my whole life. To my husband, Saleh El-Manasir for his emotional and
financial support. To my sons, Abd Al-Kareem, Al-Mahdi, and Ahmed for their love and
emotional support.

Samah Ziyad Rahamneh

ii

TABLE OF CONTENTS

ACKNOWLEDGEMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

ii

DEDICATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

ii

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

vi

LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

vii

LIST OF ABBREVIATIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

ix

1. INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1

1.1. Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2

1.2. Research Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3

1.3. Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4

1.4. Dissertation Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5

2. BACKGROUND AND LITERATURE REVIEW . . . . . . . . . . . . . . . . . .

6

2.1. Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6

2.1.1. Application Partitioning . . . . . . . . . . . . . . . . . . . . . . . . .

6

2.1.2. Application Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . .

7

2.1.3. Software Profiling and Hardware Cost Estimation . . . . . . . . . . .

10

2.1.4. Objective (Cost) Function . . . . . . . . . . . . . . . . . . . . . . . .

10

2.1.5. HW/SW Partitioning Techniques . . . . . . . . . . . . . . . . . . . .

11

2.2. Taxonomy of Application Partitioning Efforts in Hardware/Software Co-design 11

iii

Table of Contents—Continued
2.3. Review and Discussion of Partitioning Techniques Based on the Partitioning
Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3.1.

15

Partitioning Efforts Based on the Partitioning Approach and Partitioning Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

15

2.3.2. Detailed Discussion of Some Existing Partitioning Efforts . . . . . . .

20

2.3.3. Multi-objective Partitioning Algorithms . . . . . . . . . . . . . . . . .

24

3. METHODOLOGY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

26

3.1. Hardware and Software Tools . . . . . . . . . . . . . . . . . . . . . . . . . .

26

3.1.1. Hardware Acceleration Research Program (HARP) . . . . . . . . . .

26

3.1.2. Open Computing Language (OpenCL) . . . . . . . . . . . . . . . . .

29

Cloud Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

30

3.3. Partitioning Applications in a Heterogeneous CPU-FPGA System . . . . . .

32

3.3.1. HW/SW Partitioning Problem Formulation . . . . . . . . . . . . . .

32

3.4. Automated Partitioning Algorithms . . . . . . . . . . . . . . . . . . . . . . .

34

3.2.

3.4.1.

Particle Swarm Optimization for HW/SW partitioning . . . . . . . .

34

3.4.2. Genetic Algorithm (GA) . . . . . . . . . . . . . . . . . . . . . . . . .

37

4. SYNERGIC EXECUTION TECHNIQUES FOR A CPU-FPGA PLATFORM:
CANNY EDGE DETECTOR AS A CASE STUDY . . . . . . . . . . . . . . . . .

40

4.1. Background and Related Work . . . . . . . . . . . . . . . . . . . . . . . . . .

41

4.1.1. Sliding Window Based Edge Detectors: Canny Algorithm . . . . . . .

41

4.1.2. Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

44

4.2. Hybrid CPU-FPGA Acceleration for Canny Algorithm . . . . . . . . . . . .

46

4.2.1. Canny Code Partitioning between the CPU and the FPGA . . . . . .

47

iv

Table of Contents—Continued
4.2.2. Delay-based Weighted Round Robin Distribution of the Workload between the CPU and the FPGA . . . . . . . . . . . . . . . . . . . . .

48

4.3. Experimental Setup and Evaluation Metrics . . . . . . . . . . . . . . . . . .

49

4.4. Experimental Results and Discussion . . . . . . . . . . . . . . . . . . . . . .

51

4.4.1. Code Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

52

4.4.2. Data Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

52

4.5. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

58

5. AUTOMATED HW/SW PARTITIONING OF CLOUD APPLICATIONS USING
HEURISTIC OPTIMIZATION ALGORITHMS . . . . . . . . . . . . . . . . . . .

61

5.1.

Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

62

5.1.1. Modeling Applications Components . . . . . . . . . . . . . . . . . . .

63

5.2. Artificially Tuned PSO (APSO) Parameters . . . . . . . . . . . . . . . . . .

66

5.2.1. Generating Data-set for PSO Training . . . . . . . . . . . . . . . . .

67

5.3. Local Search-based Technique to Mitigate Premature Convergence . . . . . .

68

5.4. Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

71

6. CODE AND DATA PARTITIONING . . . . . . . . . . . . . . . . . . . . . . . . .

86

6.1. Code-Data Partitioning PSO (CDPSO) . . . . . . . . . . . . . . . . . . . . .

86

6.2. Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

89

7. CONCLUSION AND FUTURE WORK . . . . . . . . . . . . . . . . . . . . . . .

92

7.1. Summary and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . .

92

7.2. Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

94

BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

95

.1.

Partitioning Cost Results for the Different Benchmarks . . . . . . . . . . . . 107

v

LIST OF TABLES
2.1
2.2

Hybrid HW/SW partitioning algorithms . . . . . . . . . . . . . . . . . . . .
A Comparison of HW/SW partitioning algorithms. . . . . . . . . . . . . . .

13
19

3.1
3.2

HARP HW/SW configurations. . . . . . . . . . . . . . . . . . . . . . . . . .
PSO parameters definitions. . . . . . . . . . . . . . . . . . . . . . . . . . . .

28
37

4.1
4.2

FPGA resource usage and frequency for different kernels implementations. .
Execution time comparison among GPGPU and FPGA Canny accelerators
and our CPU-FPGA hybrid implementation. . . . . . . . . . . . . . . . . . .

57

5.1 Components cost of k-means algorithm. . . . . . . . . . . . . . .
5.2 Components cost of Canny edge detection algorithm. . . . . . .
5.3 Components cost of advanced encryption standard algorithm. .
5.4 PSO and GA experimental parameters. . . . . . . . . . . . . . .
5.5 Comparison of the partitioning cost among heuristics algorithms
1
2
3

. . . . .
. . . . .
. . . . .
. . . . .
and ES.

.
.
.
.
.

.
.
.
.
.

59
65
65
66
72
77

Partitioning cost of k-means benchmark. . . . . . . . . . . . . . . . . . . . . 107
Partitioning cost of Canny benchmark. . . . . . . . . . . . . . . . . . . . . . 108
Partitioning cost of AES benchmark. . . . . . . . . . . . . . . . . . . . . . . 109

vi

LIST OF FIGURES
2.1
2.2
2.3

HW/SW application co-design process flowchart. . . . . . . . . . . . . . . .
DAG representation of a five tasks application graph. . . . . . . . . . . . . .
Thematic taxonomy of application partitioning efforts. . . . . . . . . . . . .

8
9
12

3.1
3.2
3.3
3.4
3.5

HARP system architecture([91]). . . . . . . . . .
OpenCL program compilation flow. . . . . . . . .
Particle Swarm Optimization algorithm flowchart.
Particles values for partitioning six nodes graph. .
Genetic Algorithm flowchart. . . . . . . . . . . . .

.
.
.
.
.

27
29
36
37
39

4.1
4.2
4.3
4.4
4.5
4.6
4.7
4.8

A 3*3 Gaussian Filter. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Sobel vertical filter. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Sobel horizontal filter. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Vertical and horizontal operators of sobel filter. . . . . . . . . . . . . . . . .
Canny algorithm flow chart. . . . . . . . . . . . . . . . . . . . . . . . . . . .
CPU-FPGA code-partitioned processing for Canny algorithm [104]. . . . . .
Hybrid CPU-FPGA processing of images [104]. . . . . . . . . . . . . . . . .
CPU-FPGA tile-based processing for Canny edge detection
algorithm [104]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Tile generation and padding [104]. . . . . . . . . . . . . . . . . . . . . . . . .
Execution time for CPU-only, FPGA-only, and CPU-FPGA code-partitioned
implementations [104]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Execution time of different images using different tiles sizes [104]. . . . . . .
Execution time for CPU-only, FPGA-only, and CPU-FPGA hybrid implementations [104]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Energy delay product for CPU-only, FPGA-only, and CPU-FPGA hybrid
implementation [104]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
OpenCL kernel with multiple compute units and SIMD lanes [126]. . . . . .

42
43
43
43
44
47
50

4.9
4.10
4.11
4.12
4.13
4.14
5.1
5.2
5.3

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

Application modeling leveraging the LLVM compiler. . . . . . . . . . . .
The structure of NN used in training PSO. . . . . . . . . . . . . . . . . .
Partitioning cost of GA, MA, PSO, APSO and LPSO using 10, 30, and
iterations and different sizes of population for k-means algorithm. . . . .
5.4 Partitioning cost of GA, MA, PSO, APSO, and LPSO using 10, 20, and
iterations and different sizes of population for Canny algorithm. . . . . .
5.5 Partitioning cost of GA, MA, PSO, APSO, and LPSO using 10, 30, and
iterations and different sizes of population for AES algorithm. . . . . . .

vii

.
.
.
.
.

. .
. .
60
. .
30
. .
60
. .

50
51
53
54
56
57
58
63
71
74
75
76

List of Figures—Continued
5.6
5.7
5.8
5.9
5.10
5.11
5.12

5.13

6.1
6.2
6.3

Partitioning latency of GA, MA, PSO, APSO and LPSO using 10, 30, and 60
iterations and different sizes of population for k-means algorithm. . . . . . .
Partitioning latency of GA, MA, PSO, APSO, and LPSO using 10, 20, and
30 iterations and different sizes of population for Canny algorithm. . . . . .
Partitioning latency of GA, MA, PSO, APSO, and LPSO using 10, 30, and
60 iterations and different sizes of population for AES algorithm. . . . . . . .
Execution time and energy consumption of GA, MA, PSO, APSO, and LPSO
using k-means algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Execution time and energy consumption of GA, MA, PSO, APSO, and LPSO
using Canny algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Execution time and energy consumption of GA, MA, PSO, APSO, and LPSO
using AES algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Partitioning cost of GA, MA, PSO, APSO, and LPSO the average of 10, 30,
and 60 iterations and different sizes of population for k-means, Canny, and
AES algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Partitioning cost of GA, MA, PSO, APSO, and LPSO using 10, 30, and 60
iterations and the average of ten different populations for k-means, Canny,
and AES algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
HW/SW partitioning using PSO (upper) and CDPSO (lower). . . . . . . . .
HW/SW partitioning cost of GA, MA, PSO, APSO, LPSO, and CDPSO. . .
Execution time and energy consumption of GA, MA, PSO, APSO, LPSO, and
CDPSO. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

viii

78
79
80
82
82
83

84

85
87
90
91

LIST OF ABBREVIATIONS

CSPs

Cloud Service Providers

HPC

High Performance Computing

OpenCL

Open Computing Language

PSO

Particle Swarm Optimization

TS

Tabue Search

SA

Simulated Annuling

BB

Branch Bound

ACO

Ant Colony Optimization

ABO

Artifical Bee Optimization

LLVM

Low Level Virtual Machine

IR

Intermediate Rpresentaion

GA

Genetic Algorithm

MA

Memetic Algorithm

ESs

Embeded Systems

APSO

Artifical Particle Swarm Optimization

LPSO

Local-search Particle Swarm Optimization

CDPSO

Code Data Particle Swarm Optimization

NN

Neural Network

MLR

Multi Linear Regression

SIMD

Single Instruction Multiple Data

CU

Compute Unit

ix

CHAPTER 1
INTRODUCTION
Cloud computing is a dominant computing paradigm in the IT industry due to its agile
on-demand access to hardware and software resources as well as its pay as you go economical
pricing model. Cloud-based data centers serve a vast range of workloads such as machine
learning, image/video processing, and high-performance financial algorithms [1]. Emerging resource-consuming applications (compute-consuming, storage-consuming, or networkconsuming) and the huge amount of data exported to the cloud by the Internet of things
(IoT) devices have introduced new challenges to Cloud Service Providers (CSPs). The diversity of workload characteristics stimulated deploying of of heterogeneous architectures to
accommodate application requirements disparity. A heterogeneous architecture is a mash
up of different processing powers, for instance, Central Processing Unit (CPU), Graphical
Processing Unit (GPU), Field Programmable Gate Array (FPGA), and Application Specific
Integrated Circuit (ASIC). FPGAs have advantages over other accelerators in cloud-based
environments because of their power, performance and re-configurability benefits. For example, reconfiguration produces a custom circuitry that highly optimized for a specific workload [2]. As predicted, a large fraction of data centers’ nodes will include FPGA logic by
2021 [3]. Below is a summary for the motivation behind the integration of FPGAs in the
cloud:
• Elastic re-programmability, through flexible and customized circuit design. This agile
behavior allows programmers to generate a circuit that meets applications requirements.

1

• Performance improvement: FPGAs are capable of exploiting both data-level and tasklevel parallelism of applications. This gives FPGA superior performance over CPUs
and GPUs. As the CPU’s fixed circuity is able to exploit mainly task-level parallelism
through multi-threading, where GPU is able to exploit only data-level parallelism.
• Energy Efficiency: the low clock frequency of FPGAs make them power efficient devices. FPGAs reduce the total power consumption mainly through two co-design approaches. The first approach is to offload computation hungry tasks to FPGA instead
of executing them on CPU or GPU. The second approach is to deactivate un-configured
(unused) logic until it is used by another task [4].
• Capital Expenditure (CapEx) cost reduction: as a server with integrated FPGA outperforms two servers without FPGA in terms of response time for Bing search page
ranking as per Microsoft [5]. However, the cost of two servers is much higher than a
server with integrated FPGA.
• Emerging High-level synthesis tools (HLS) and the Open Computing Language (OpenCL).
HLS have made FPGAs programmable using high-level languages such as C and
OpenCL. This unlocked FPGA power to all cloud customers, including moderate level
programmers, and not only Hardware Description Languages (HDL) programmers.
These software Development Kits (SDK) are provided by giant FPGA vendors such as
Intel and Xilinx.

1.1. Problem Statement
The conventional approach to utilize heterogeneous CPU-FPGA platforms is to offload
the entire workload to the FPGA. This approach has many challenges. Chief among these are
the huge size of code and data of cloud applications/services, in addition to FPGA limited
2

logic resources. Hence, to achieve the most benefit of such infrastructure, the workload has
to be intelligently distributed between the CPU and the FPGA. In this dissertation, we
propose and implement efficient techniques to partition workload between the CPU and the
FPGA in a cloud-scale CPU-FPGA platform (Intel’s HARP), to increase performance and
reduce energy consumption.

1.2. Research Objectives
In this dissertation, we propose and implement automated application partitioning techniques for a cloud-scale heterogeneous CPU-FPGA platform. These techniques aim at improving system utilization, performance, and energy consumption. We also investigate some
of the widely used partitioning algorithms and address their weakness. We, also, propose
and implement machine learning-based techniques to improve the performance of the Particle Swarm Optimization (PSO) algorithm. Our partitioning technique is a Multi-Objective
Optimization (MOO) function. The objectives of the dissertation are:
• Providing a detailed taxonomy of the HW/SW partitioning literature in the embedded
systems and High-performance Computing (HPC) systems, and comparing the existing
partitioning algorithms.
• Designing and implementing cooperative execution techniques between the CPU and
the FPGA in a cloud-scale heterogeneous CPU-FPGA platform.
• Formulating and solving the HW/SW partitioning problem in the context of emerging
CPU-FPGA integrated architectures. In this step, we use a machine-learning-based
variation of PSO to partitioning OpenCL-based cloud applications/services. In addition, we use a heuristic local-search techniques to improve the quality of PSO and
Genetic Algorithm (GA) partitioning.
3

• Proposing and realizing a technique that automatically partitions the code and the
data of an application between the CPU and the FPGA.
• Partitioning OpenCL kernels manually and automatically. To the best of our knowledge, this study is the first that considers partition OpenCL applications automatically
using heuristic algorithms.

1.3. Research Questions
This dissertation designs and implements efficient collaborative execution techniques between the CPU and the FPGA in a cloud-scale hybrid CPU-FPGA system ( Intel’s HARP).
It also proposes and implements different strategies to improve accuracy and/or reduce the
execution time of nature-inspired optimization algorithms such as PSO and GA. The dissertation answers the following questions:
• What are the techniques/approaches that could be applied to partition a workload
between the CPU and the FPGA in a heterogeneous CPU-FPGA platform?
• How to efficiently utilize a hybrid CPU-FPGA system and meet workload requirements,
such as execution time and energy consumption?
• How to address the weaknesses/vulnerabilities of PSO and GA, in the context of
HW/SW partitioning, using machine learning and informed distributed search techniques?
• Is it possible to automatically partition a workload code and data, simultaneously,
between the CPU and FPGA in a heterogeneous CPU-FPGA platform?

4

1.4. Dissertation Organization
The rest of this dissertation is organized as follows: Chapter 2 presents a background
knowledge, a detailed classification and thematic taxonomy of the HW/SW partitioning
literature. Chapter 3 describes the methodology, set of tools, underlying hardware platform,
programming framework and techniques that were used in this dissertation. Chapter 4
presents a design and implementation of collaborative execution techniques that achieve
concurrent utilization of the CPU and the FPGA on Intel’s HARP. In chapter 5, different
optimization techniques that improve the accuracy and/or execution time of PSO and GA
were proposed and implemented. In Chapter 6 a variation of PSO algorithm that partitions
a data-parallel workload between the CPU and the FPGA at the code and the data level
simultaneously was advised and implemented. Finally, Chapter 7 presents some drawn
conclusions and future work.

5

CHAPTER 2
BACKGROUND AND LITERATURE REVIEW
This chapter thoroughly investigates the efforts of HW/SW partitioning. It also provides
the necessary background information and terminology, in addition to a classification of
HW/SW partitioning algorithms.

2.1. Background
This section presents the fundamental concepts of application partitioning process such
as application modeling and cost assignment.

2.1.1. Application Partitioning
Application partitioning is referred to the process of dividing an application into N different components C, where some components are implemented in software (CPU cores) and
the others are implemented in hardware (FPGA, ASIC, or DSP). The partitioning algorithms aim to optimize the performance, reduce system cost, minimize power consumption
or meet area constraints the objective(s) of the system. Below is a summary of the benefits
of HW/SW co-design and application partitioning:
• Optimize and boost the performance of complicated current and emerging algorithms
and applications, such as new machine learning and bio-informatics algorithms.
• Meet system design goals and constraints, for instance cost, power, security, and area.
• Exploit the huge advances in technology, especially new hardware accelerators.

6

• Reduce energy consumption during the execution of applications and moving toward
green computing.
Application partitioning is a major phase in HW/SW co-design in embedded systems,
HPC, and mobile cloud computing environments. HW/SW co-design consists mainly of four
phases: application modeling, software profiling and hardware cost estimation, objective
function formulation, and the HW/SW partitioning techniques. All of these phases have a
direct impact on the quality of the partitioning decision. Fig. 2.1 illustrates the HW/SW
co-design process [6].

2.1.2. Application Modeling
In order to partition an application, the application is modeled in standard ways, for
instance a graph or a Finite State Machine (FSM) [7]. Due to the heterogeneity and vast
wave of embedded and distributed applications, there are many models that are used to
represent these applications. Direct Acyclic Graph (DAG) is one of the most efficient ways
to model applications [8][9][10]. In DAG, an application is divided into tasks. Each task
is modeled as a vertex and the control flow of an application is modeled as directed edges
connecting these vertexes. Fig. 2.2 shows a DAG representation of an application that
consists of five tasks.
DAG can be generated with a uniform or a random distribution. There are different types
of uniformly distributed DAG in application partitioning literature, such as fork-joint [11],
out-tree [10], and Fast Fourier Transform (FFT) [12]. Uniformly distributed DAG has a
positive effect on the partitioning process when used for application modeling when compared
to randomly distributed DAG [9].
Many other representative models have been used in modeling embedded applications,

7

Application Modeling
(graph or state
machine)

SW Profiling: Offline
or Online

HW Cost
(latency, area, and
power) Estimation

Modeled Application
with SW/HW Cost
Assigned

System Specifications
(latency, area, or
power)

Objective Function

HW/SW Partitioning
( Static, Semi-static, or
Dynamic)

SW Components

HW Components

Figure 2.1: HW/SW application co-design process flowchart.

8

such as Control Flow Graph (CFG) [13], Data Flow Graph (DFG) [14]–[16], Control Data
Flow Graph (CDFG) [17] and State Transition Graph (STG) [18]. Applications are, also,
modeled at different levels of granularity, coarse, intermediate, or fine. In coarse granularity
modeling, an application is modeled as a set of objects. Where it is modeled as a set of
functions or single instruction in intermediate and fine granularity modeling, respectively.In
coarse-grained modeling, the resulting graphs are simpler than fine-grained application modeling. Hence, profiling and partitioning of coarse-grained modeled application are faster and
easier than fine-grained modeled application. However, the accuracy of the partitioning
decision of fine-grained modeling is higher than a coarse-grained modeling.

Figure 2.2: DAG representation of a five tasks application graph.

9

2.1.3. Software Profiling and Hardware Cost Estimation
To partition an application in a way that achieves design goals, statistical information
about application’s components need to be collected and analyzed using a profiling tool. This
allows to set proper boundaries between software and hardware parts of the application [19].
The proper boundary between SW and HW depends on partitioning goals. Profiling is the
process of analyzing a code. Profilers work at two levels of granularity, high and low. High
level profilers track main procedures in the code and hence they are fast. On the other hand,
low level profiler tracks each line in the code, so they are slow [20]. The granularity of the
profiler has a direct impact on the partitioning efficiency (the optimality of the partitioning
decision). High-level profiler speeds up the partitioning process, whereas it gives a brief
analysis of performance bottlenecks of an application. Low-level profiler, on the other hand,
generates a detailed analysis of an application, which helps the partitioning algorithm to
divide the code efficiently ( in term of solution optimality). However, this in detail study
of the code is at the expense of partitioning speed. One profiler example is Altera SDK
for OpenCL that provides efficient profiling techniques for OpenCL applications [21]. Estimating hardware cost such as energy consumption and logic utilization requires different
estimation tools. One of the hardware performance estimation tools is Xilinx ISE analyzer
and timer [22], which estimates the hardware power and area.

2.1.4. Objective (Cost) Function
In order to find the optimal boundaries between hardware and software, the partitioning
problem can be mathematically formulated. The mathematical formulation is called the
objective function. The objective function aims at minimizing or maximizing one or multiple
design criteria, such as power consumption or performance, given system constraints, such

10

as available hardware resources or total system cost.

2.1.5. HW/SW Partitioning Techniques
An application can be partitioned in a static, semi-static, or dynamic way. In static
partitioning, an application is divided into HW and SW components at compile time. This
leaves the partitioning algorithm oblivious about the run-time behavior of the application.
Semi-static partitioning algorithm partitions an application at compile time, however, it
exploits run-time information that is collected from previous runs. On the other hand, the
dynamic partitioning algorithm divides the application at runtime. More details about these
techniques are discussed in 2.2 and 2.3.

2.2. Taxonomy of Application Partitioning Efforts in Hardware/Software
Co-design
Most HW/SW partitioning efforts targeted embedded and high performance applications.
Whereas application requirements are very different for these systems. In this section, we
classify existing HW/SW partitioning efforts based on various partitioning model attributes;
we follow a taxonomy similar to the one presented by Liu et al. for mobile cloud computing [23]. We also discuss these attributes along with examples from the literature. Then, we
review in details some of partitioning techniques.
Figure 2.3 classifies application partitioning efforts into multiple categories based on five
attributes. The partitioning attributes include the partitioning model, partitioning granularity, objective function, profiling techniques, partitioning approach, and annotation approaches. The partitioning model represents the type of the algorithm(s) used for dividing
applications into hardware and software components. In general, there are three main partitioning models, exact algorithms, heuristic algorithms, and hybrid algorithms. Using exact
11

Figure 2.3: Thematic taxonomy of application partitioning efforts.

algorithms partitioning model, an application is represented as a graph or a linear objective
function. This model results in an optimal solution [24][25]. However, it is time consuming
when exploring a complex design space and thus inefficient [15], [25]–[29]. This type of algorithms include: Dynamic Programming (DP), Branch and Bound (B&B) and Integer Linear
Programming (ILP). In Heuristic algorithms, the design space is explored intelligently to find
a set of near optimal solutions [30]–[35]. In hybrid algorithms, a combination of exact and

12

heuristic algorithms are used to achieve an efficient partitioning in a reasonable time [16],
[36]–[38]. Table 2.1 summarizes some o the hybrid partitioning algorithms that have been
used in the literature.
Table 2.1: Hybrid HW/SW partitioning algorithms
Hybrid Partitioning approach
GA & PSO

Publication

Contribution

[39]

FCM & PSO

[38]

TS & PSO

[39]

BB & PSO

[40]

FEO & PSO

[41]

GA & TS

[42]

Generates a better quality solution compared to PSO
and faster than GA. Slow exploration of design space
compared to PSO.
Generates better apartitioning solution in a shorter
time compared to PSO and FCM. Applicable to both
binary and extended partitioning but there is No consideration for communication cost between HW and
SW.
Combines the parallel nature of PSO with the memory
feature of TS to reduce the PSO run-time for large
graphs.
Exploits the efficiency of PSO to speed up the partitioning process, BB generates more accurate partitioning decision compared to the proposed algorithm.
The authors proposed a conformist PSO (CPSO) to
avoid trapping in a local minimum and enhance search
diversity. In order to improve the quality of the CPSO
output, they combined the CPSO with fireworks explosion operations( FEO) which stimulates the swarm
to traverse disparate regions looking for an optimal
solution.
The authors used the TA as a local search technique
with GA. This combination of TS and GA is one variation of Memetic Algorithm (MA). They demonstrate
the robustness of the algorithm and its ability to generate a better quality solution as the expense of the
execution time.

The granularity level indicates the level for partitioning compute-intensive applications.
In particular, there are various granularity levels for application partitioning, which include
the following [23]:

13

• Component-level partitioning: partitioning occurs at the level of a group of classes.
• Module-level partitioning: partitioning occurs at the level of the entire application.
• Sub-process level partitioning: partitioning occurs at the level of application methods.
• Task-level partitioning: partitioning is at the level of the tasks.
• Thread-level partitioning: partitioning occurs at the level of application threads.
The partitioning objective indicates the objective function for application partitioning.
Partitioning objectives include one or more of the following:
• Improving performance: There are various measurements for performance improvement. These measurements include throughput [43], execution time [29], [35], [41],
[44]–[48], and algorithm space and time complexity.
• Saving energy: Reducing energy consumption is a crucial demand for embedded systems and it is very important for all other systems as well. Moreover, it is getting more
attraction due to many factors, such as big data deluge and the global energy problem
[49]–[52].
• Improving application scalability: Many new emerging compute/data-intensive applications are infeasible without code/data partitioning and distribution [53], [54] .
• Improving the utilization of system resources.
The partitioning approach is the technique used for extracting and revealing any dependencies among application components. Partitioning approaches can be either static at
compile time (off-line) [45], [49], [55], [56], semi-static that exploits off-line and on-line analysis of an application [57], [58], or dynamic at run-time (on-line) [59]–[61]. The annotation
14

attribute indicates meta-data that is added to the source code to aid partitioning [23]. Annotation can be done either manually by the programmer [43], [53], or automatically by the
compiler [45], [49], [55], [56], [62].

2.3. Review and Discussion of Partitioning Techniques Based on the
Partitioning Approach
In this section we review and discuss partitioning algorithms based on the partitioning
approach and partitioning model.

2.3.1. Partitioning Efforts Based on the Partitioning Approach and Partitioning
Model
Traditionally, applications have been manually partitioned into HW and SW components [63]–[67]. However, the growing complexity of embedded and HPC systems leads to an
exponential increase in the design space. This makes manual partitioning challenging. For
this reason, automatic approach has attracted system designers [44], [45], [56], [68]–[70]. In
general, there are three main approaches of automated HW/SW partitioning; static, semistatic, and dynamic. In static partitioning, an application is divided at compile time, which
leaves this kind of partitioning oblivious about the runtime behavior of the partitioned application. Most of the existing partitioning approaches are classified as static approaches
because it is simpler to implement than dynamic partitioning. Dynamic Partitioning, on the
other hand, divide a binary file on the fly, at run-time. Hence, it able to extract any hidden
behaviors at compile time at the expense of partitioning complexity [71]–[74]. Semi-static
partitioning divides an application off-line exploiting on-line information and analysis from
previous runs.

15

Static HW/SW Partitioning

Static partitioning uses the Worst Case Execution Time (WCET) scenario to meet design
goals and specification. Static HW/SW partitioning algorithms are classified as exact or
heuristic algorithms. In the exact approach, partitioning algorithms iterate until reaching
the optimal solution in the design space. However, these algorithms are cumbersome for
complex systems and efficient only for small graphs [38]. This is due to the fact that HW/SW
partitioning is an NP-hard problem [75], [76]. Moreover, finding the optimal solution is timeconsuming for a large design space. The exact (optimal solution) algorithms include: integer
programming, dynamic programming, and branch and bound.
Integer and linear programming are efficient optimization tools for small to medium size
systems. Using these techniques, HW/SW partitioning is mathematically formulated as an
objective function with a set of design constraints that ensure feasibility of the solution [77].
Design goals and system specifications determine the objective function of a system.
Dynamic programming demonstrates its ability to produce an optimal partitioning of
code components. Knudsen and Madsen [78] proposed a dynamic programming algorithm
that minimizes the execution time of a system given hardware area as a constraint. The
algorithm , also, can minimize hardware area given execution time as a constraint, with
time complexity of O(A(N )2 ) and space complexity of O(A(N )), where N is the number
of code partitions and A is the hardware area. However, time complexity of the proposed
partitioning algorithm is the main drawback. Hence, a low complexity dynamic programming
algorithm is proposed by Wu and Srikanthan [79] with a time complexity of O(A.N). Branch
and bound is another global optimization algorithm. It gives a proofed optimal solution of
the partitioning process [80].
Nonetheless, the hardness of the partitioning process in the context of modern embedded
16

and HPC systems gives heuristic approaches a superior advantage over exact approaches.
Heuristic algorithms are faster than the exact ones as a result of their informed and intelligent
exploration of the design space. However, they produce a near-optimal solution. Heuristic
algorithms are divided into iterative and constructive algorithms. Most iterative heuristic
algorithms start with a random point in the design space and iterate until reaching a global
sub-optimal solution. These algorithms include Simulated Annealing (SA) [24], [81], Particle Swarm Optimization (PSO) [30]–[33], Genetic Algorithms (GA) [34], [35], Hill Climbing
(HC) [82], Ant Colony Optimization (ACO) [83], Tabu Search (TS) [14], [24], Fuzzy Logic
(FL) [84] and Artificial Bees Colony (ABC) [85]. On the other hand, constructive heuristic
algorithms start with a partial solution and keep adding components to the initial solution,
and sometimes removing components, until reaching a stop criterion, such as greedy algorithm and hierarchal clustering. Heuristic partitioning techniques, however, have limitations
such as the execution time of GA increases dramatically with increasing design space. As
such, combining multiple heuristic techniques to overcome this limitation is highly recommended.
Hybrid partitioning algorithms are a combination of exact/heuristic algorithms or heuristic/heuristic algorithms. They aim at improving the efficiency of the partitioning process and
solution optimality in a reasonable time. Whereas the main objective of these combinations
differ from one study to another, there are many hybrid partitioning algorithms [36] [37] [16].
For instance, Jiang et al. combined GA and SA to improve the optimality of the generated
solution, taking into consideration cost and delay constrains [36]. Combining GA and PSO
produced a more optimal solution in less time compared to GA and PSO [37]. The algorithm
exploits crossover of GA to control the position and speed of particles. Li et al. [16] proposed a hybrid algorithm by combining GA and TS (GATS). They demonstrate the ability
of GATS to improve the performance by generating optimal (shortest) time assignment of
17

application components on both hardware and software. Fuzzy C-means (FCM) and PSO
were combined (FCMPSO) to improve the quality of the solution in a shorter time by applying FCM to swarm particles [38]. SA and TS are combined by Liu et al. [9]. The authors
update TS tables using the annealing process in order to speed up tables update. TS has two
tables for storing local and global best candidates in the design space. The new combination
of SA and TS achieves a superior performance when compared with SA and TS for both
HW/SW partitioning.
Table 1 compares different HW/SW partitioning algorithms in terms of their speed, accuracy, space complexity, and ease of implementation. This comparison is based on combining
results from several research studies [24], [36]. The comparison aims at providing an approximate differentiation among the optimization algorithms based on the aforementioned
metrics. For instance, Else et al. [24] showed through experimental results that TS outperforms SA. This appears in the table by assigning TS (S++) for speed metric where assigning
(S+) for SA. On the other hand, Jiang et el. [36] demonstrated how combining GA and SA
has a tangible improvement over utilizing GA only. Hence, we assigned GA (S-) to indicate
the significant gap between the GA and SA. The same procedure was followed to assign the
values of the different metrics for all the algorithms in the table. Below is a discussion of
different partitioning efforts in more details.

Dynamic HW/SW Partitioning

Dynamic partitioning is offloading critical code segments of a binary file to a configurable
logic fabric. It achieves a better partitioning solution compared to static partitioning in terms
of performance and energy. Sitt et al. [60] [86] [60] demonstrated the feasibility of HW/SW
dynamic partitioning on a miniature configurable fabric through de-compilation, compiler
18

Table 2.2: A Comparison of HW/SW partitioning algorithms.
Partitioning Algorithm

Publications Speed

Accuracy Space
plexity

Dynamic Programming
Branch and Bound
Integer Linear Programming

[78]–[80]
[80]
[15], [25]–
[29]
[24], [81]
[30]–[33]
[83]
[34], [35]
[82]
[14], [24]
[84]
[85]

SS
S-

A++
A++
A++

C+
C–
C-

Ease
of
Implementation
E
E
E

S+
S++
S+
SS+
S++
S+
S++

A+
A+
A
A
AA+
AA-

C+
C
C++
C++
C+
C++
C++
C+

E
E+
E
E
E
E
E+
E+

Simulated Annealing
Particle Swarm Optimization
Ant Colony Optimization
Genetic Algorithms
Hill Climbing
Tabu Search
Fuzzy Logic
Artificial Bees Colony

Com-

Note: [feature X]++ is better than [feature X]+ which is better than [feature X] which is better than [feature X]- which is better
than [feature X]–. Where S=Speed, A=Accuracy, C=Space Complexity, E=Ease of Implementation.

optimization, synthesis (behavioral and logic), and place and rout of a binary file. The results
showed an average speedup of 2.6 for five benchmarks, such as PowerStone and NetBench [60].
However, the place and route phase was a performance bottleneck for their proposal, because
this phase takes minutes or hours to complete while the dynamic partitioning requires it to be
finished in seconds. In order to tackle this problem, the authors proposed a configurable logic
architecture with place and rout algorithms dedicated for dynamic partitioning [61]. The
proposed architecture and algorithms aim at boosting embedded applications performance
through speeding up critical loops. The results showed an average speedup of 2.1 and 33%
power saving. In addition, the place and rout phase is 50X faster compared to commercial
tools, with 1000X less data and code memory size.

19

2.3.2. Detailed Discussion of Some Existing Partitioning Efforts
Brogioli et al. [55] designed and implemented a DSP/FPGA architecture for optimizing
signal processing workloads. They target user detection and channel equalization in 3.5G
wireless mobile receivers that support High-Speed Down-link Packet Access (HSDPA) data
rates. In addition, this system aims at achieving superior performance and meet real-time
requirements for mobile devices. They used a defined set of criteria to partition the workload
between the DSP and the FPGA. For instance,they used data and task level parallelism,
computational complexity of applications and spatial locality of data. Moreover, the authors
used an extensive DSP/FPGA hardware simulation to identify whether a task should be
executed in software (DSP) or hardware (FPGA). The performance enhancement exceeds
90% when compared to traditional programmable DSP-based architectures.
A temporal partitioning algorithm for complex and compute-intensive image processing
applications was proposed [87]. The algorithm increases the level of parallelism among the
various tasks of an application by increasing the effective area. The effective area increase is a
direct consequence of temporal partitioning. In this approach, each application is represented
as a set of sequential tasks. Moreover, each task is associated with a set of parameters; for
instance, task parallelism, task area, execution time and computational complexity. Set of
the independent tasks were implemented on a single FPGA simultaneously. Then the next
sequential set of tasks were implemented on the same FPGA after hardware reconfiguration.
In spite of the reduction in the overall execution time, their hardware reconfiguration time
became a new bottleneck.
Jiang et al. [53] tackled the problem of the multifield packet classification. This task is
a challenging task for network routers especially in the presence of the vast Internet traffic and the diverse value-added service. As the Internet traffic grows dramatically, a new

20

trend to implement the classification algorithms in hardware emerges. However, the huge
memory demand for these algorithms creates a limitation for an efficient hardware implementation. In order to implement a high-speed hardware classifier, the authors proposed a
coarse-grained independent sets classifier, with the combination of the cross-product scheme.
This partitioning reduces the memory requirement of the classifier. They implemented the
classifier on a single Xilinx Virtex-5 FPGA. The architecture was able to store 10K five-filed
rules and achieve 90 Gbps throughput for a packet size of 40 bytes.
Busonera et al. [44] proposed a framework to optimize the code partitioning process between CPU and FPGA. The framework takes C code as an input and produces a modified
C code to be executed on the CPU and a bit-stream to configure the FPGA. It starts by
executing high-level transformations through a compiler, for instance, loop unrolling. Then,
it maximizes the size of the basic blocks and represents them using a DAG. A cluster identification algorithm takes the DAG and extracts possible subgraphs that cloud be mapped to
the FPGA. Then for each identified cluster, software latency, which is the time needed to execute the cluster on the CPU, is estimated. In addition, hardware latency, which represents
the time required to execute the cluster on hardware, is calculated by obtaining a Verilog
description of the cluster using a commercial synthesis tool. Hardware latency is estimated
as the sum of the delays for each operation of the cluster [45]. After comparing the estimated
software latency and the calculated hardware latency, the code was partitioned, and then
speedup was calculated. Using a synthesis tool to calculate the hardware latency instead
of estimation makes it more accurate to optimize the code partitioning, and also speeds up
execution time [44]. However, the selection accuracy in this framework is at the expense of
the selection speed between hardware or software implementations. This framework achieves
up to 250% improvement in the performance compared to an estimation-based approach.
Zhang et al. [43] proposed a new design to improve the throughput and speed up of
21

sorting large data sets on a CPU-FPGA heterogeneous platform. They used Intel Quick Path
Interconnect (QPI) as poit-to-point interconnection between the CPU and The FPGA. They
developed a hybrid sorting algorithm optimized for heterogeneous CPU-FPGA platform.
The CPU acts as system orchestrator and task dispatcher, while the FPGA is used to
accelerate the compute-intensive part. Simultaneous execution of the algorithm on both the
CPU and the FPGA is achieved through a divide and conquer strategy, exploiting task-level
parallelism. In order to overlap execution, a large data set is divided into subsets to fit into
the FPGA’s on-chip memory. These data chunks are stored in a shared memory between the
CPU and the FPGA. Then, the FPGA reads data chunks and uses the quick-sort algorithm
to sort and write them back to the shared memory. Once a new sorted chunk is written back
to the memory, the CPU is triggered to merge it with the previously sorted chunks. The
throughput and performance of this design are compared with the CPU-only and FPGAonly baselines. The results show that this design outperforms the two baselines in terms of
throughput and resource utilization. In addition, it achieves 2.3X throughput over a multicore implementation. The main disadvantage of such a design is the overhead of overlapped
execution in terms of consumed power, which is not measured in the paper. This is because
the data is divided into smaller data sets, moreover, these small data sets are divided into
smaller chunks to be sorted in parallel on the FPGA.
Kelm et al. [56] proposed a methodology and a platform to efficiently exploit data-parallel
accelerators through partitioning. CUBA Infrastructure Guided Application Remapping
(CIGAR) allows software developers to identify data parallel parts of their applications,
through a dynamic application profiling tool, and port them to the data parallel accelerator. Moreover, it determines which of the application’s data structures to be hosted on
the accelerator and also performs a debugging to verify the correctness of the partitioning.

22

CIGAR runs a full version of Linux operating system on a CPU/FPGA platform. For guidelines and correctness purposes, CIGAR uses integer benchmarks during the development
and verification stages. However, the main concern of the platform is the correctness of the
partitioning process, leaving the performance and power consumption issues open. Further,
the development process is guided by three integer benchmarks, which is not enough in terms
of application variety and coverage.
Heuristic automatic partitioning schemes, for embedded systems, based on simulated
annealing and tabu search were evaluated and compared [24]. The authors designed a cost
function as the communication cost during partitioning. In addition, they adopt a coarsegrained partitioning granularity at the level of subprograms and processes, so that a minimum
communication cost is guaranteed. The simulated annealing optimization algorithm has a
preference over other well-known automatic partitioning algorithms used at the time, due
to its ability to avoid trapping at a local optimum. Where the algorithm accepts a worse
solution especially at the initial iterations hoping to find the global optimum. However, SA
needs a large number of experiments, to reach the optimum, and a long execution time. On
the other hand, tabu search uses two data structures to keep track of the visited solutions,
while exploring the solution space. The first one is a short-term memory which is used to
store information about recently visited solutions. The second is a long-term memory to
store information about non-neighbor solutions. Through extensive simulation that is based
on random/geometric graphs, the authors found that automatic partitioning based on tabu
search outperforms automatic partitioning based on simulated annealing.
Few existing partitioning algorithms take into consideration the communication cost between hardware and software during the partitioning process. One example is Phase Greedy
Meta-heuristic Algorithm (PGMA) [88] that models the partitioning process as a multiobjective optimization problem. The algorithm aims at optimizing various system attributes,
23

such as optimizing area utilization, execution time, power consumption, and memory usage.
A superior efficiency was demonstrated over existing algorithms through partitioning the
Joint Photographer Expert Groups (JPEG) encoder with 1000 blocks.

2.3.3. Multi-objective Partitioning Algorithms
As HW/SW partitioning is a system design optimization problem, it can target different
design goals, such as reducing power consumption, minimizing execution time, minimizing
system area, and reducing the overall system cost. A HW/SW partitioning algorithms can
be classified as single-objective or multi-objective (Pareto) optimization process. Focusing
on a single-objective function simplifies the co-design process (partitioning and scheduling).
Moreover, it avoids being trapped in optimizing conflict objectives and advising complicated
partitioning algorithms [29], [35], [41], [44]–[48]. On the other hand, most of the emergent
computing models require optimizing multiple metrics simultaneously. For instance, HP
embedded systems, cluster, and cloud computing demands high performance of their applications while keeping the operational expenditures (such as power) within a budget to
grantee acceptable revenues [89].
The contribution of this dissertation that makes it different than previous studies are summarized as follow:
• Target a cloud-scale heterogeneous CPU-FPGA platform, and design and implement
cooperative partitioning techniques, namely code partitioning and data partitioning,
to improve applications’ performance, energy consumption, and system utilization.
• Partition OpenCL-based cloud applications using heuristics algorithms such as PSO
and GA compared to previous studies, which have used random graphs instead of real
applications.
24

• Optimize PSO’s parameters in the context of HW/SW partitioning problem using a
machine learning-based approach. Although the accuracy and the execution time of
PSO depend to a large extent on its parameters, previous studies have used rulethumbed values.
• Tackle the premature convergence problem of some heuristic optimization algorithms,
such as GA and PSO by extending these algorithms with a local search technique.
• Partition an openCL-based application at the code and the data levels. This is realized
by duplicating some parts of the application on the CPU and the FPGA. To the best
of our knowledge, this is the first study that proposes and implements code and the
data levels partitioning.

25

CHAPTER 3
METHODOLOGY
This chapter discusses the techniques and tools that were utilized in this research. It
also discusses methods and approaches followed to tackle the problem and achieve the study
objectives.

3.1. Hardware and Software Tools
This section describes the hardware we use, and the required Software Development Kit
(SDK).

3.1.1. Hardware Acceleration Research Program (HARP)
In 2015, Intel introduced the Hardware Research Acceleration Program and infrastructure (HARP v1) to stimulate research in different fields, such as programming tools, and
accelerator-based computing systems [90]. HARP platform, consisting of several nodes, is a
data center product that is designed to boost the performance of emerging applications in an
energy efficient manner. Each node in HARP consists of an Intel Xeon CPU integrated on
a multi-packed chip with an Altera Stratix V FPGA. The closely coupled CPU and FPGA
communicate through a single Quick Path Interconnect (QPI) bus and share a dynamic random access memory (DRAM) [91]. The FPGA accesses memory through the CPU. The QPI
has been developed earlier by Intel to connect processors in a Non Uniform Memory Access
(NUMA) architecture [92].
In 2017, Intel announced the second generation of HARP program (HARP v2), which
consists of an Intel Broadwell Xeon CPU (14 cores) that is integrated with Intel Arria 10
26

GX 1150 FPGA into a Multi-Chip Package (MCP).
In HARP v2, the CPU and the FPGA are connected through two PCIe Gen3x8 and
one QPI as illustrated in Figure 3.1. Both of these interfaces have separate read and write
channels. PCIe3x8 has a transfer speed of 8 GT/s while the QPI is theoretically capable
of achieving 6.4 GT/s [93] . The traffic between the CPU and the FPGA is distributed
over these channels based on link utilization, using a Virtual Channel (VC) streaming unit.
Figure 3.1 illustrates HARP system architecture.
Integrated FPGA
Arria 10 GX 1150
AFU

FIU
VC
Steering
Logic

FPGA
Cache

QPI

PCIe 3Gen8

Controller
DDR
RAM

LLC

CPU
Cores

Intel Broadwell Xeon

Multichip Package (MCP)
Figure 3.1: HARP system architecture([91]).

A bitstream file is composed of a FPGA’s Interface Unit (FIU) and an Accelerator Functional Unit (AFU) [94]. The FIU contains Intel-provided Intellectual Property (IP) and it
implements I/O control, power, and temperature monitoring. The FIU has at least one
27

interface to the AFU. It is a programmer-defined FPGA code, which determines the functionality of the FPGA [95]. Developers can implement more than one AFU on the FPGA
to meet their performance and power goals.
In this heterogeneous platform, the CPU and the FPGA share an address space in the
global memory. The FPGA contains a cache to mitigate the latency of accessing the shared
memory. The cache is connected only to the QPI interconnect, and it is 64 KB in size with a
cache line of 64 bytes. In a read cycle of the FPGA, if it is a cache miss, the FPGA will read
from the CPU’s Last Level Cache (LLC), which is 35 MB in size [91], [96]. Using shared
memory, no redundant copies of input images are created, as opposed to other designs,
where the device (or FPGA) has its own global memory. In addition, in a shared memory
implementation, the FPGA can handle larger data sets since the shared memory is larger
than the global memory of a FPGA device, in a non-shared memory environment. However,
a memory contention between the CPU and the FPGA might happen.
Table 3.1: HARP HW/SW configurations.

CPU configurations
Host CPU Model
Intel Broadwell Xeon
CPU Frequency
2.1 GHz
LLC
35 MB
FPGA Fabric
Arria 10 GX 1150
DRAM
95 GB
FPGA configurations
Adaptive logic modules
427,200
Logic elements
1,150,000
Registers
1,708,800
Memory
65.5 kib
DSP
1,518

28

3.1.2. Open Computing Language (OpenCL)
The OpenCL is an open, royalty-free, unified programming model for accelerating algorithms on heterogeneous systems [97]. OpenCL allows programmers to write their code for
various platforms such as CPU, GPU, DSP, and FPGA. Intel FPGA SDK for OpenCL provides a vendor extension Application Programming Interfaces (API). An OpenCL program
consists of two main files, host code file, which is a C/C++ file, and the kernel code file
which is OpenCL file (*.cl). The host code is executed on the CPU and is responsible for
dispatching the kernels to the guest devices. The kernel code is executed on the OpenCL
devices; it is usually a compute-intensive code. The host code is compiled using traditional
C/C++ compilers to generate the executable file. On the other hand, the kernel is compiled to generate a bitstream file using Altera Offline Compiler (AOC). Figure 3.2 shows the
process of compiling an OpenCL application.

Figure 3.2: OpenCL program compilation flow.

29

3.2. Cloud Applications
In cloud environments, there is avast range of application domains that can benefit from
FPGA logic. These domains include machine learning, data compression/encryption, and
image/video processing [98]. In this study, we focus on three widely used applications from
the domains of machine learning, image processing, and data security. Below is a description
of the three benchmarks that were developed and used in this dissertation.
• k-means Clustering:
k-means is an unsupervised clustering algorithm that takes a data set where each data
point consists of N-dimensional observations. The algorithm divides the data set into
k clusters depending on a certain similarity criterion such as Euclidean distance. kmeans has been widely used as a clustering algorithm in many fields, for instance,
machine learning and image processing. The algorithm is a compute-intensive one and
can definitely benefit from HW/SW acceleration. The algorithm consists mainly of the
following steps:
– Initialize clusters centers randomly.
– Assign each data point in the data-set to the closest cluster.
– Update the centers of the clusters based on the new data points assigned to the
cluster.
• Advanced Encryption Standard (AES ):
Sharing and virtualization of cloud resources have enabled a tremendous number of
customers to access a vast range of cloud services. However, this sharing has posed
a security challenge for Cloud Service Providers (CSPs) [99]. AES is a symmetric

30

block cipher encryption algorithm, that is extensively used in the cloud. The algorithm consists of many rounds of transformations that are performed on the stored
data [100] [101]. AES uses different key sizes, which are 128, 192, and 256 bits [102].
A different number of rounds is used with each key length. The algorithm consists
mainly of two stages, the encryption stage, and the decryption stage. The encryption
stage converts the input message into ciphertext, while the decryption process converts
the ciphertext back into the input message. The encryption stage is summarized as
follows:
– Byte substitution: In this step, each 16 bytes block of data is substituted using a
Look-Up Table (LUT).
– Shift rows: A circular shift is performed on the rows of the encryption matrix.
The second row is shifted one byte to the left. The third row is shifted two bytes
to left. The last row is shifted three bytes to the left.
– Mix columns: A mathematical function is used to transform all the bytes in each
column. The new matrix is completely completely different from the input matrix.
Columns mixing is not performed in the last encryption round.
– Add round key: The 128-bit round key is XORed with the input matrix. The
result of the last round is the ciphertext.
The decryption process of AES is a reverse order of the encryption process. Each
round in the encryption process consists of add round key, mix columns, shift rows,
and byte substitution.
• Canny Edge Detection Algorithm:
Visual social media is exporting billions of images to the cloud on a daily basis [103].

31

Image processing algorithms, that are used to process these images, are data and
compute-intensive. One of these algorithms is Canny edge detection algorithm. The
algorithm consists of four main stages. We proposed and implemented OpenCL -based
hardware accelerators of Canny algorithm on HARP [104].

3.3. Partitioning Applications in a Heterogeneous CPU-FPGA System
Workload partitioning between the CPU and the FPGA happens at two main levels;
data level and code level. The choice between these levels depends on the nature of the
workload. For instance, image processing applications could be partitioned at data or code
level. Other workload such as k-means algorithm could not be partitioned at data level due
to data dependency.

3.3.1. HW/SW Partitioning Problem Formulation
Automatic HW/SW partitioning is an NP-hard optimization problem [15]. There is
no standard formalization approach to represent the problem. Mathematically, HW/SW
partitioning cloud be formulated as a single objective function or a multi-objective function.
Either way restrictions on other objectives can be applied.
Control Flow Graph (CFG)
The partitioning process starts with modeling an application as a graph G = (V, E), where
V is a set of vertices and E is a set of edges that connect these vertices. The nodes in
V = {v1 , v2 , ..., vN }, where N is the number of nodes in the application graph. For each
vertex/node, a SW cost and a HW cost are assigned and used to evaluate the cost of the
node using an objective function.

32

Objective function
The objective function is a mathematical formula that describes the output of the system.
It cloud be either a minimization or a maximization problem. The objective function cloud
also be a Single Objective Optimization (SOO) or a Multi-Objective Optimization (MOO).
In SOO, as the name suggests, there is one goal that we try to optimize. However, in MOO,
the objective function has more than a single objective to optimize. Adding more objective
increases the complexity of the optimization problem. In addition, these objectives are conflicting and a trade-off has to be made. We illustrate different objective functions starting
from a constrained single objective function to multi-objective formulation. For instance,
formulation 3.1 minimizes the execution time f1 (x) constrained to other single objective
functions such as power consumption and HW area. xi in the formulation indicates if vertex
vi is assigned to SW ( xi = 0) or HW (xi = 1). swi and hwi represent the cost of a vertex/node (i) on software and hardware respectively, andM represents the number of constraints.

min f1 (x) =
x

N
X

(swi (1 − xi ) + xi hwi )

i=1

s.t.

∀fi (x) ∈ (lbi , ubi ), ∀i ∈ [1, M ]

(3.1)

xj ∈ [0, 1], ∀j ∈ N
Another constrained single objective function, f2 (x) that aims at minimize HW area is
illustrated in formulation 3.2. It is constrained to other objective functions such as execution

33

time and power consumption.

min
f2 (x) =
x
s.t.

N
X

xi hwi

i=1

∀fi (x) ∈ (lbi , ubi ), ∀i ∈ [1, M ]

(3.2)

xj ∈ [0, 1], ∀j ∈ N
We extended our partitioning using a MOO function as shown in equation 3.3. This MOO
formulation aims at optimizing execution time, energy and area simultaneously.

min
F (x) = w1 f1 (x) + w2 f2 (x), .... + wM fM (x)
x
s.t.

fi (x) ≤ bi ∀i ∈ M

(3.3)

xj ∈ [0, 1], ∀j ∈ N.
Each of these single objective functions ( f1 (x) - fM (x)) represents one cost criterion such
as execution time, energy consumption, or resource utilization. We used w1 equals to 1, w2
equals to 0.6, and w3 equals to 0.3. These values could be varied depending on the designer’s
preferences. We gave the execution time a higher weight because it is more important than
the energy and the HW area in the cloud environment.

3.4. Automated Partitioning Algorithms
This section discusses the automated HW/SW partitioning algorithms that were used in
this study.

3.4.1. Particle Swarm Optimization for HW/SW partitioning
Particle Swarm Optimization (PSO) is a stochastic optimization technique developed in
1995 by Kennedy and Eberhart [105]. The algorithm mimics the social behavior of animals
34

herds and birds flocks in searching for food. In PSO, a swarm of P particles searches for the
optimal value in a D dimensional search space.
Each bird, or particle, communicates with other particles in the swarm to find the food. The
food location is the optimum position that could be found by the swarm. In order to search
the space effectively, each particle has two attributes; particle position Xi and velocity Vi .
During the space exploration, each particle updates its position and velocity influenced by
the best location that was discovered by that particle and also by the best location that
was discovered by the swarm. The best position that is found by the particle is called pbest
(personal best), while the best position that is found by the swarm is called gbest (global
best).
A particle uses equation 3.4 to update its velocity and equation 3.5 to update its position.
Table 3.2 defines the parameters used in equations 3.4, and 3.5, where k indicates the number
of the iteration. After initializing the positions and velocities of swarm’s particles, the PSO
iterates until reaching the maximum number of iterations or achieving an acceptable error.
Figure 3.5 illustrates the the flowchart of the PSO algorithm.

i
Vk+1
= Wk Vki + c1 r1 (Pki − Xki ) + c2 r2 (Pkg − Xki )

(3.4)

i
i
Xk+1
= Xki + Vk+1

(3.5)

PSO in context of HW/SW partitioning:
When using PSO to partition a graph, each particle consists of a vector (X) and each element
in the vector represents a node of the graph. The particle’s vector is the partitioning solution.
We use a value of 0 to indicate that a node is run by SW, and 1 to indicate that a node
is offloaded to HW. For instance, particle1 in Figure 3.4 represents a feasible partitioning

35

Start
Initialize the algorithm parameters swarm
size, stopping criteria, w, c1, c2.
Initialize particles positions
randomly
Initialize particles velocities
randomly
Evaluate the objective function f
(x) for all particles
Find particles best positions p.best
and swarm best position g.best
Update particles velocities and
positions

No

Stopping
criteria
met?
Yes
Swarm best position
End

Figure 3.3: Particle Swarm Optimization algorithm flowchart.

36

Table 3.2: PSO parameters definitions.
Parameter
c1
c2
r1, r2
Wk
Xik
Vk
Pki
Pkg

Definition
cognitive parameter (self confidence constant)
social parameter (swarm confidence constant)
perturbation factors
inertia weight
particle position
particle velocity
particle best known position
swarm best known position

solution. It indicates that all nodes are implemented in SW while particle2 indicates that
all nodes are implemented in HW. However, in particle3 the first two nodes are executed by
SW and the rest are executed by HW. In order to move a node to HW or leave it in SW, we
need to evaluate the cost for the node in SW and HW and choose the partitioning solution
that minimizes the objective function.

Figure 3.4: Particles values for partitioning six nodes graph.

3.4.2. Genetic Algorithm (GA)
GA is one of the most robust evolutionary algorithms. It mimics the survival of fitness in
nature. Generally, GA consists of the following steps. Figure 3.5 illustrates the the flowchart
of the PSO algorithm.
37

• Chromosome representation: GA population consists of a set of chromosomes, and each
chromosome consists of a set of genes. Representation of the chromosomes and their
interpretation are dependent on the optimization problem to be solved. In HW/SW
partitioning, each chromosome is a string of ones and zeros, which represents a possible
partitioning solution. Each gene represents a node in a graph, and it has a value of 0
if the node is implemented in SW and 1 if the node is implemented in HW.
• Population initialization: After initiating a set of definite-length chromosomes, the
chromosomes are initialized randomly with values of 0 and 1.
• Fitness evaluation: In order to choose parents for the next generation, we have to
evaluate the fitness of each solution in the population. We use a cost function that
could be a maximization or minimization optimization function to evaluate the fitness.
• Selection: To guarantee the survival of good genes, GA elects chromosomes with the
best fitness values to a mating pool. In the mating pool, parents mate through a
crossover operation.
• Crossover: to mimic the mating process, GA performs a crossover operation among
parents in the mating pool. The crossover is carried out by swapping random parts of
the two parents. The resultant new chromosomes (solution) are called offspring.
• Mutation: To avoid being trapped in a local optimum, GA changes random genes
of the chromosomes aiming a producing higher quality chromosomes [106]. This also
improves the diversity of the population.

38

Start
Initialize population size, no. of generations
and chromosomes’ dimensions
Initialize chromosomes values
randomly
Evaluate the objective function f
(x) for all chromosomes
Select parents to the mating pool
Perfroms crossover operation
among parents
Mutation of off-springs

No

Stopping
criteria
met?
Yes
Population’s best
chromosome
End

Figure 3.5: Genetic Algorithm flowchart.

39

CHAPTER 4
SYNERGIC EXECUTION TECHNIQUES FOR A CPU-FPGA PLATFORM:
CANNY EDGE DETECTOR AS A CASE STUDY
The processing demands of current and emerging applications, such as image/video processing, are increasing due to the deluge of data, generated by mobile and edge devices.
This raises challenges for a vast range of computing systems, starting from smart-phones
and reaching cloud and data centers. In this chapter, we used Intel’s HARP to accelerate
a sliding window based image processing algorithm, Canny edge detector. We accelerated
Canny using two different implementations: code partitioned and data partitioned. In the
data partitioned implementation, we proposed a weighted round robin based algorithm that
partitions input images and distributes the load between the CPU and the FPGA based on
latency. This chapter also compares the performance of the proposed accelerators with separate CPU and FPGA implementations. We evaluated the benefits of application partitioning
using the Canny algorithm in terms of execution time and energy consumption.
In the context of workloads nature, 80% of Internet traffic is video and 3.2 billion images
are shared every day on social media websites [103]. Moreover, tweets that contain images
are 150% retweets than text tweets, and LinkedIn image posts receive 200% more interaction
than text-only posts [107]. Sliding window based image processing algorithms are among
the widely used algorithms in image processing applications such as computer vision and
image segmentation [108]. An edge detector, a sliding window-based algorithm, identifies
edges in digital images through a compute and data-intensive convolution process [109]. The
majority of the literature focused on offloading compute-intensive components to FPGAs,
leaving CPU cycles unused. In addition, most of the emerging high-performance algorithms

40

such as big data and machine learning algorithms do not fit completely on one FPGA board.
In this work, we use a hybrid CPU-FPGA system and overlap execution on both the CPU
and the FPGA to increase system performance and reduce energy consumption. We designed
and implemented an efficient hybrid algorithm for Canny edge detector, as a case study, on a
heterogeneous CPU-FPGA architecture using OpenCL. We utilized a delay-based Weighted
Round Robin (WRR) algorithm to partition and distribute images between the CPU and
the FPGA. In addition, we partitioned the computation between the CPU and the FPGA
to increase the overall system throughput, reduce latency, and improve resource utilization.
Although a FPGA only implementation outperforms a CPU only implementation of
Canny by more than 2X in terms of execution time, our proposed hybrid data-partitioned
implementation boosts the throughput of the entire system and reduces the total execution
time even further. The hybrid implementation outperforms the CPU-only implementation by
up to 4.8X, and the FPGA-only implementation by up to 2.1X. We also estimated the total
energy consumed by the different implementations. Our hybrid CPU-FPGA implementation
results in 55% reduction in energy consumption, on average, compared to the CPU-only
implementation.

4.1. Background and Related Work
This section presents the fundamental background about Canny edge detector and discusses some of the related work.

4.1.1. Sliding Window Based Edge Detectors: Canny Algorithm
Edge detection is used in many image processing applications such as image segmentation
and computer vision. It is considered an efficient preprocessing stage to reduce the amount

41

of data to be processed by filtering noise and outliers in digital images. Canny filter is
a powerful edge detection algorithm [110] [111], but it is a compute-intensive one. The
algorithm receives a Red-Green-Blue (RGB) image as an input and produces an enhanced
gray-scale edge detected image through a multi-stage image processing framework. The
algorithm consists of the following steps:
• Step 1: Image Blurring
The Canny algorithm filters out the noise in an image using a Gaussian filter as a
preprocessing stage. Blurring, also known as smoothing, the input image mitigates
generating false edges in the output image. Figure 4.1 shows a 3*3 blurring mask
(Gaussian filter).
1/16

1/8

1/16

1/8

1/4

1/8

1/16

1/8

1/16

Figure 4.1: A 3*3 Gaussian Filter.

• Step 2: Derivatives Magnitude and Orientation
In this stage, Canny calculates derivatives of the blurred image in the horizontal and
the vertical directions using horizontal and vertical masks of Sobel filter, as shown
in Figure 4.4. The blurred image is convolved with these masks to generate vertical
and horizontal derivatives. Then the vertical and horizontal derivatives are used to
calculate the magnitude and direction of the total derivative for each pixel as shown
in Figure 4.5.

42

+1

0

-1

+2

0

-2

+1

0

-2

Figure 4.2: Sobel vertical filter.

+1

+2

+1

0

0

0

-1

-2

-1

Figure 4.3: Sobel horizontal filter.
Figure 4.4: Vertical and horizontal operators of sobel filter.

• Step 3: Non-maximal Suppression
In this step, most of the weak edges are eliminated, set to zero, using non-maximal
suppression. The strength of each pixel is compared to a set of pixels in a certain
neighborhood window for each direction. If the pixel’s gradient is the largest in that
window, its value will be preserved. The value of the other pixels in the window will
be suppressed to zero.
• Step 4: Hysteresis
Remaining weak edges could be either real edges or noise. In order to differentiate
between them, a connected component labeling technique is used to connect weak
edges to strong edges in a predefined neighborhood. In case there is no strong edge in
the neighborhood, weak edges are eliminated by setting them to zero [112].

43

Colored
Image
RGB
Gray-scale
Image

Step 1

Image
Smoothing

Sobel X Derivative
Gx

Sobel Y Derivative
Gy
Step 2

Edge Magnitude and Orientation Calculations
G = 𝑠𝑞𝑟𝑡 ( 𝐺𝑥2 + 𝐺𝑦2)
𝐺𝑦
𝛩 = 𝑎𝑡𝑎𝑛((𝐺𝑥 ))

Non-maximal
Suppression

Step 3

Hysteresis
Step 4

Edge
Detected
Image

Figure 4.5: Canny algorithm flow chart.

4.1.2. Related Work
The Canny edge detection algorithm is an efficient yet computation-intensive algorithm.
Therefore, there are many efforts to accelerate the algorithm using GPGPUs [113][114] and
FPGAs [115][116][117]. To the best of our knowledge, in these implementations, the entire
computation is offloaded to the GPGPU or the FPGA leaving the CPU idle waiting for the
final result from the co-processor; none of the previous works overlapped the CPU and the
accelerator execution. However, in our implementation, we exploit all the processing units

44

in the system (CPU and FPGA) to accelerate the algorithm by overlapping the execution
between the CPU and the FPGA. In this way, both the CPU and the FPGA take a portion
of the processing/workload in order to minimize the overall execution time.
Lourenco et al. [113] implemented the Canny algorithm using Compute Unified Device
Architecture (CUDA), and they tested their implementation on three generations of Nvidia
GPGPUs. The performance of their implementation was compared to an OpenCV-based
CPU implementation. The results showed that their implementation outperforms the CPU
only implementation. Xu et al. [111] proposed a hardware accelerator for Canny edge detection algorithm by offloading the entire image frame to the FPGA. They accelerated the nonmaximal suppression stage of the algorithm by predicting maximum and minimum threshold
values instead of using frame level statistics. The algorithm was implemented on a Xilinx
FPGA board. The results show 64% resource utilization and 87% BRAM memory utilization.
It took 0.751 ms to process a 250 kilo pixels frame. The proposed algorithm targets real-time
embedded systems. Another pure hardware accelerator for Canny algorithm is presented by
Neoh and Asher [118]. They used two FPGA boards, Stratix II EP2S60 and Stratix EP1S20
to accelerate the algorithm. The authors provided an analysis for resource utilization of the
different stages of the algorithm on the FPGA. Gupta [119] proposed and implemented a
model that computes traffic load for real-time traffic signal control utilizing edge detection.
He used a Xilinx FPGA in conjunction with MATLAB Simulink to implement the proposed
model. However, in these implementations, the algorithm was offloaded completely to a
separate FPGA board and CPU cycles were left unused. Quinne et al. [120] proposed a
HW/SW co-design framework to accelerate image processing pipelines. It partitions the
image pipeline components between hardware and software using exact and heuristics algorithms, where the number of components is limited to 20. This method boosts performance

45

by offloading certain components to the FPGA. Gentsos et al. [121] designed and implemented a parallel architecture to process Canny algorithm in real time. They processed
four pixels at a time instead of one pixel to maximize throughput. Their implementation
demonstrated the ability to improve the total throughput with a minor increase in resource
utilization using Xilinx FPGAs. He and Yuan [117] optimized the thresholding stage of the
Canny algorithm by automatically setting the threshold value by the algorithm itself. A
pipelined implementation of the new algorithm was realized on an FPGA board. The new
algorithm improved the quality of the resultant edges compared to the original algorithm.
We use OpenCL to accelerate Canny on a heterogeneous CPU-FPGA platform. Some of the
studies used an approximate implementation of the algorithm. In this implementation, they
used predefined statistics values related to the image frames instead of calculating them by
iterating over the image pixels [122]. This approximation reduces the execution time of the
algorithm at the expense of the accuracy of the resultant edge. The aforementioned studies
focused on speeding up the implementation of the algorithm. There are other studies, on
the other hand, that focused on improving the quality of image edges [117].
Different than previous works, we aim at improving the utilization of both the CPU and
the FPGA to further increase performance. While previous works offloaded the algorithm or
portion of it to a separate FPGA chip, our implementation relies on close integration between
the FPGA and the CPU, where they both share the same memory to increase performance.

4.2. Hybrid CPU-FPGA Acceleration for Canny Algorithm
We designed two different hybrid CPU-FPGA implementations that simultaneously use
both a CPU and a FPGA to accelerate Canny. In the first implementation, we partitioned the
algorithm to process certain parts on the CPU and other parts on the FPGA. In the second
implementation, both the CPU and the FPGA execute the entire algorithm on different
46

parts of the image (different image tiles) to maximize performance.

4.2.1. Canny Code Partitioning between the CPU and the FPGA
We partitioned Canny code to overlap the execution between the CPU and the FPGA,
where each process only part of the algorithm in a cooperative way to produce the output.
Code partitioning is an efficient way to increase the performance of algorithms [123]. Figure 4.6 shows Canny algorithm partitioned on both the CPU and the FPGA, running on a
HARP node.

Main Memory

CPU

CPU Core
Cache
Controller

Input
Image

LLC

Grayscale
conversion
and image
smoothing

Shared
memory

QPI and PCIe
Interconnections
Pipe

FPGA

Image derivative, Non
maximum suppression
and hysteresis

VC Steering Logic
FPGA Interface
Unit(FIU)
WG1

Cache

Figure 4.6:

AFU

CPU-FPGA code-partitioned processing for Canny algorithm [104].

In this method, the CPU is responsible for processing Step 1, gray scale conversion and
blurring, of the algorithm for the entire input image. Once the CPU processes a portion of
47

the image consisting of several windows, it sends them to the FPGA to continue processing
the rest of the algorithm. The CPU sends several blurred gray-scale windows of the image
to the FPGA. Then the FPGA starts processing those windows and at the same time the
CPU continue processing Step1 for the rest of the image. The kernel on the FPGA processes
the image in a pipelined hardware for Steps 2 to 4 of Canny. Each work item in OpenCL
is processed through one pipeline. This process repeats until the CPU finishes Step 1 for
the entire image and sends it to the FPGA, and the FPGA finishes Steps 2 to 4 for the
entire image and stores the final image into the shared DRAM. The CPU’s and the FPGA’s
calculations overlap; however, the FPGA is at first idle until the CPU finishes processing Step
1 for some windows. Although, we partitioned the code between the two processing units to
achieve higher performance and increase simultaneous processing, the CPU and the FPGA
stays idle for some portions of the time. This opens another opportunity to optimize the
hybrid algorithm by increasing the overlap of the execution between both units, as discussed
below.

4.2.2. Delay-based Weighted Round Robin Distribution of the Workload between the CPU and the FPGA
In the second hybrid CPU-FPGA implementation of Canny, we aimed for maximizing the
CPU-FPGA processing overlap by allowing both to execute the same kernel on different parts
of the image simultaneously. We first split the image into tiles. The tiles were distributed
to the CPU and the FPGA using a Weighted Round Robin (WRR) algorithm. Tiling is
efficient in hyper scale data centers, where billions o.f images need to be processed in a realtime fashion. It allows pipelined image processing algorithms to be more efficient and meet
timing constraints.
After tiling the source image, tiles are distributed to available system processing units

48

(CPU and FPGA) in such a way that minimizes execution time and maximizes system
throughput. WRR is a simple yet efficient scheduling technique that demonstrates its effectiveness in different fields. Weights are assigned to both the CPU and the FPGA using CPU
to FPGA ratio of the execution time of a single tile. The tiles are then distributed to the
CPU and the FPGA proportionally to their execution time of a single tile. This delay-aware
split of tiles between the CPU and the FPGA boosts the parallelism of the system through
simultaneous handling of different data sets by the different devices. Figure 4.7 abstracts
the tiling and distribution process of an image between the CPU and the FPGA. We used
the equation below to assign weights to the CPU and the FPGA.

WCP U = tile exec. timeCP U /tile exec. timeF P GA

(4.1)

Since the FPGA is found to be approximately as twice faster as the CPU for Canny algorithm, the CPU is assigned double the weight of that of the FPGA. This method can
be used for many other heterogeneous architectures and algorithms to balance the load of
the different architectures and maximize performance. Furthermore, fine-grained tiles can
be used to achieve higher processing parallelism and thus performance. Figure 4.8 shows a
block diagram of the CPU-FPGA system with an illustration of Canny’s data-partitioned
implementation.
4.3. Experimental Setup and Evaluation Metrics
We used one HARP v2 node to accelerate Canny. As mentioned earlier, HARP’s nodes
each consist of a closely coupled CPU and FPGA. We used Intel FPGA SDK tool version
16.01 for OpenCL. This tool provides a compiler to build and run OpenCL targeting Intel
FPGAs.

49

Figure 4.7: Hybrid CPU-FPGA processing of images [104].

Main Memory

CPU

Image
tiling and
splitting

Cache
Controller

Input
Image

CPU Core

LLC

Compute
kernel
instance

Shared
memory

QPI and PCIe Interconnections

FPGA
VC Steering Logic
FPGA Interface Unit

WG2
WG1

Cache

AFU

Figure 4.8: CPU-FPGA tile-based processing for Canny edge detection
algorithm [104].

We used different image sizes as inputs to Canny algorithm (from 0.5 megapixel to 8
megapixels). We implemented two different hybrid CPU-FPGA implementations, a CPUFPGA code partitioned implementation and a CPU-FPGA data partitioned implementation.
In the data partitioned implementation, we partitioned the input images into tiles to be

50

Figure 4.9: Tile generation and padding [104].

processed by the CPU and FPGA simultaneously. Figure 4.9 shows the process of tiling the
source image and padding the tiles with neighboring pixels from the source image for correct
functionality.
We measured the speedup of the accelerated algorithm by measuring its execution time.
We compared the execution time of the hybrid CPU-FPGA implementation of Canny to
both CPU and FPGA only implementations. In addition, we estimated the energy consumption for both the CPU, the FPGA and the hybrid implementations. The CPU’s energy
consumption was estimated based on Intel’s power documentation for Xeon processors [124],
and the FPGA’s energy consumption was estimated using Power-Play [125], which is a power
estimator for Intel Arria 10 devices. Using this tool, the FPGA energy consumption can be
estimated through both its resource utilization and operating frequency. We evaluated the
energy delay product (EDP) of our hybrid implementation.

4.4. Experimental Results and Discussion
In this section, we present and discuss the experimental results of the two hybrid CPUFPGA implementations discussed previously. We also compare the results with those for the

51

CPU and the FPGA only runs.

4.4.1. Code Partitioning
As mentioned in Section 4.2.1, we partitioned Canny algorithm between the CPU and the
FPGA. This approach reduces the total execution time of the algorithm compared to a CPUonly implementation. The energy consumption of this implementation is also more efficient
than a CPU only implementation. This is because the compute-intensive portion of the code
is offloaded on a power-efficient processing unit, the FPGA. Code partitioning accelerated the
algorithm about two fold as shown in Figure 4.10 compared to a CPU-only implementation.
On the other hand, code partitioning accelerates the algorithm up to 1.5X for images smaller
than four megapixels. However, for larger images, four megapixels and bigger, the FPGAonly implementation slightly outperforms the code-partitioned implementation. The reason
behind this is the communication overhead when passing the partially processed image from
the CPU to the FPGA to be completely processed and then the final output image is written
back to the shared memory.

4.4.2. Data Partitioning
In the code-partitioned approach, although the CPU and the FPGA run parts of the
algorithm at the same time, they can be idle for certain times. The CPU will be idle after
it finishes processing Step 1 of the algorithm for the entire image. Additionally, the FPGA
will be waiting for the CPU to finish Step1 for several windows to start processing Steps 2
to 4 for several windows in parallel, due to the parallel nature of the FPGA. As such, an
implementation that totally overlap CPU and FPGA processing is desired to further increase
performance. In the second approach, we partitioned input images into tiles as discussed

52

CPU-only

FPGA-only

Code-partitioned

160

Execution time (ms)

140
120
100
80
60
40
20

0
0.5 Mpixels

1 M pixels

2 Mpixels

4 Mpixels

8 Mpixels

Image size

Figure 4.10: Execution time for CPU-only, FPGA-only, and CPU-FPGA codepartitioned implementations [104].

previously in section 4.2.2. Tiles are mapped to the CPU or the FPGA dynamically using
WRR algorithm. Each tile is assigned to either the CPU or the FPGA, then processed and
the output of the edge-detected image is written back to the shared memory.
Figure 4.11 depicts the effect of different tile sizes on the total execution time. Different
tile sizes were tested for all the images. As shown in the figure, a tile size of 250 kilo-pixels
produces the minimum execution time for all the tested images. Smaller tile sizes increase
the total execution time for all image sizes. This is due to the overhead of the padding
pixels that wrap the image tiles after tiling. For instance, tiling a four megapixel image,
using a tile size of 50 kilo-pixels results in a five fold padding pixels compared to a tile
size of 250 kilo-pixels. These additional pixels mean extra reading, processing, and writing
time. On the other hand, increasing the tile size has also a negative impact on the total
53

execution time. This is because of the performance difference between the CPU and the
FPGA, so coarse-grained tiles would result in a reduced overlap of both the CPU and the
FPGA processing times, and thus reduced performance. As such, we chose a tile size of
250 kilo-pixels for our experiments. We calculated the overhead of the padding pixels and
arrived at equation 4.2 to calculate the additional padding of the tiles. Where padding is
the number of the additional padding pixels.
1 Mpixels

2 Mpixels

4 Mpixels

8 Mpixels

EXECUATION TIME (MSEC)

60
50
40
30
20
10
0
0

100

200

300

400

500

600

700

800

900

1000

TILE SIZE IN KILO PIXELS

Figure 4.11: Execution time of different images using different tiles sizes [104].

padding = 2 ∗ (N o._of _tiles) ∗ (tile_width + 2)

(4.2)

+2 ∗ (N o._of _tiles) ∗ (tile_height + 2)
Figure 4.12 shows the performance achieved using our tile-based hybrid implementation over CPU and FPGA implementations for different image sizes. For example, using
a two-megapixel image, the speedup gained by the hybrid implementation is 4.8X over a
CPU-only and 2.1X over a FPGA-only implementations. For a one-megapixel image, the
hybrid implementation result in 2.2x speedup over the CPU and no noticeable speedup over

54

the FPGA-only implementation. The is because as we tend to process small images, the
CPU becomes a bottleneck and its execution time can become dominant over the FPGA’s
execution time. The FPGA consumes its data while the CPU is still processing its part.
The CPU bottleneck can be solved by assigning the FPGA a higher weight leaving the CPU
with only a small portion of the frame.
We also estimated the energy consumption of Canny algorithm on the different architectures. The hybrid CPU-FPGA implementation reduces energy consumption up to approximately 73% and on average 55% for the different image sizes compared to the CPU only
implementation. This reduces the total energy consumption of such heterogeneous CPUFPGA servers and also reduces cooling requirements. We also calculated the Energy Delay
Product (EDP ) for the different implementations, as in high-performance and cloud computing the execution time is of significance although lower energy consumption is desired.
Figure 4.13 shows the EDP for the CPU only, FPGA only and the hybrid CPU-FPGA implementations. The figure shows that the EDP of the hybrid implementation of the algorithm
is close from that of the FPGA only implementation and much lower than the CPU only
implementation. It also shows that for image sizes below 4 mega pixels, the EDP values for
both implementations are similar.
In order to reduce the hybrid implementation’s execution time, further optimizations for
the hardware accelerator were implemented. By default, each OpenCL kernel is implemented
in hardware as a single compute unit. Adding multiple compute units can increase performance, where each compute unit has its separate memory interfaces. All compute units are
capable of processing multiple workgroups simultaneously. We varied the number of compute
units from one to ten. A reduction in execution time is observed untill it reaches a certain
limit (3 compute units). This is due to a contention among the compute units while accessing the FPGA’s cache. Moreover, increasing the number of compute units increases kernel
55

CPU-only

FPGA-only

Hybrid CPU-FPGA

160

140

Execuation time (ms)

120

100
80

60
40

20
0

0.5 Mpixels

1 M pixels

2 Mpixels

4 Mpixels

8 Mpixels

Image size

Figure 4.12: Execution time for CPU-only, FPGA-only, and CPU-FPGA hybrid implementations [104].

resource utilization of the FPGA, in addition to increasing power consumption. Hence, the
number of compute units is subject to power and area constraints of the design. Figure 4.14
illustrates the effect of increasing the number of CUs and the SIMD lanes of an OpenCL
kernel on the memory interface.
Another optimization technique that we studied is bundling work items within the same
workgroup into Single Instruction Multiple Data (SIMD) lanes. A workgroup with multiple
SIMD lanes results in a slightly lower execution time compared to a work-group without
SIMD. This is because using multiple SIMD lanes increases the amount of parallel computations within the same work group. Moreover, SIMD behavior allows the hardware compiler to
coalesce memory accesses. Although both multiple compute units and multiple SIMD lanes
increase a kernel’s performance, using multiple SIMD lanes does not consume additional
56

2500

CPU-only

FPGA-only

Hybrid CPU-FPGA DP

EDP ( joules.s)

2000

1500

1000

500

0
8

4

2

1

0.5

Image Size (MegaPixels)

Figure 4.13: Energy delay product for CPU-only, FPGA-only, and CPUFPGA hybrid implementation [104].
Table 4.1: FPGA resource usage and frequency for different kernels implementations.

Kernel implementation
CUs = 1
CUs = 2
CUs = 3
CUs = 4
CUs = 5
CUs = 6
CUs = 7
CUs = 8
CUs = 9
CUs = 10
SIMD = 2
SIMD = 4
SIMD = 8

Logic utilization
52%
56%
60%
63%
65%
68%
72%
75%
78%
81%
52%
52%
54%

ALUTs
24%
25%
27%
28%
28%
30%
31%
32%
33%
34%
24%
25%
25%

Dedicated
logic registers
30%
32%
34%
36%
38%
40%
42%
44%
46%
48%
30%
30%
32%

Memory
blocks
28%
33%
37%
42%
47%
15 %
56%
60%
65%
73%
28%
29%
31%

DSP
blocks
13%
13%
14%
14%
15%
219
16%
16%
17%
17%
13%
13%
13%

Fmax(MHz)
271.29
256.08
238.3
223.01
220
219
218
217.2
216.05
215.96
245
245
273

FPGA resources compared to increasing the number of compute units. Table 4.1 illustrates
resource utilization and the frequency for different implementations of the OpenCL kernel

57

Figure 4.14:

OpenCL kernel with multiple compute units and SIMD
lanes [126].

using multiple computes units and multiple SIMD lanes on the FPGA.
Different Canny accelerators are compared with our approach in Table 4.2. In order to
fairly compare the results of the different implementations, we normalized the execution time
with respect to an image of size 256x256 and an FPGA operated at 100 MHz. As shown
from the rightmost column in the table, our approach achieves the best result. In addition,
the proposed technique is highly scalable comparing to all other implementation as we are
distributing the load between the CPU and the FPGA. We expect that our approach will
give much better results in the context of large image sizes.

4.5. Summary
In this chapter, we designed and implemented two different approaches for accelerating
a sliding window-based image processing algorithm, Canny, using OpenCL. In the first approach, the code was partitioned among the CPU and the FPGA, and resulted in up to
58

Table 4.2: Execution time comparison among GPGPU and FPGA Canny
accelerators and our CPU-FPGA hybrid implementation.

Publication

Image
size

Lourenco et 321x481
al. [113]
Luo & Du- 512x512
raiswami [127]

1150

2.3

Execution Normalized
time for time (ms)
the reference
image
0.98
11.27

768

3.4

0.85

6.5

Lee
al. [115]

500

2.225

0.16

0.8

16

4.2

4.2

0.67

208

5.24

1.3

2.7

250

6.9

0.22

0.55

et

Rao
&
Venkatesan [128]
Li et al. [116]
Our approach

Device

GPGPU
(Fermi)
GPGPU
(GTX
200)
1280x720 Xilinx
VirtexV
256x256 Xilinx
VertexE
512x512 Xilinx
Virtex 5
1024x2048 HARP
v2
(Alter
GX1150)

Frequency Execution
(MHz)
time (ms)

2.3X speedup compared to a CPU only implementation. In the second approach, both the
CPU and the FPGA executed the entire algorithm for different portions of the image. The
high-bandwidth and low-latency interconnections between the CPU and the FPGA, in addition to the shared system memory, were utilized to improve the overall performance of this
hybrid implementation. Our implementation outperforms both CPU-only and FPGA-only
implementations by up to 4.8X and 2.1X respectively. It also results in 55.3% reduction in
energy consumption, on average, compared to the CPU-only implementation. In addition,
its energy-delay product is comparable to the FPGA-only implementation. We also studied
the effect of using different kernel attributes such as multiple compute units and multiple

59

SIMD lanes, for the hardware accelerator implemented on the FPGA. Our results showed
that efficient utilization of hardware accelerators in a heterogeneous computing environments
has significant impacts on execution time. Finally, we aim to extend our hybrid design over
a cluster of CPU-FPGA in the future.

60

CHAPTER 5
AUTOMATED HW/SW PARTITIONING OF CLOUD APPLICATIONS
USING HEURISTIC OPTIMIZATION ALGORITHMS
In this chapter, we partition cloud applications using heuristic optimization algorithms.
These applications include the k-means clustering algorithm, the Canny edge detection algorithm, and the Advanced Encryption Standard (AES ) Algorithm. We used Genetic Algorithm (GA) and Particle Swarm Optimization (PSO), which are efficient population-based
stochastic optimization algorithms. However these algorithms might be trapped in a local
optimal and suffer from premature convergence [129]. In addition, PSO speed and solution
cost depend mainly on its parameters, such as inertia value, cognitive parameter, and social
parameter. Research studies [39], [40], that have utilized PSO, used generally accepted values for these parameters. Optimal PSO parameters might be different from one application
to another, which has not been considered in previous studies. In this chapter, PSO was
trained to tune its parameters using a Neural Network (NN) to choose the best combination
of its acceleration parameters for different applications. The features used to train the NN
represent workload and architecture characteristics. Also, GA and PSO were extended using distributed greedy local search mechanism to mitigate their premature convergence and
improve the accuracy of the resultant solution.
This dissertation evaluates PSO for code partitioning between the CPU and the FPGA.
As illustrated in Chapter 3 the acceleration parameters of PSO include c1, c2, and w, and
play a significant role in exploration the design space [130]. These parameters significantly
impact the convergence speed and the accuracy of PSO [131], [132]. Moreover, they might

61

vary from one application to another to better suit a particular application. PSO uses different parameters, which their values have been usually used as fixed assumption. For instance,
previous studies used values in the range [0.4 - 0.9] for w [132], and in the range [0 - 2] for
c1 and c2 [32], [41], [131], [133]. In this work, we found that values other than the aforementioned values improve the computational efficiency of PSO. Many of the nature-inspired
algorithms such as GA and PSO might suffer from premature convergence. In this situation,
GA and PSO converge to a local optimum. Moreover, in some severe cases the fitness of the
solution, generated by these algorithms, might decrease. The reason behind the premature
convergence is the loss of diversity. This means that most of the population individuals
become similar in their fitness value. Different optimization algorithms experience premature convergence at different severity. For instance, GA suffers from premature convergence
more than PSO. This is due to its evolutionary operators that might produce similar generations. On the other hand, PSO is less subject to premature convergence as it is not an
evolutionary algorithm and does not use evolution operators such as crossover. In order to
prevent the premature convergence, a technique that enhances the population divergence
was implemented in this study. This technique is a distributed local search mechanism that
aims at finding neighboring solutions with a better fitness value than the current solutions.
This technique demonstrated its ability to mitigate the effect of premature convergence at
the expense of increasing the algorithms run time.

5.1. Experimental Setup
This section describes how k-means, Canny, and AES applications were modeled. It also
discusses the optimization of the PSO algorithm to improve the performance of the HW/SW
partitioning process.

62

5.1.1. Modeling Applications Components
In the first stage of the HW/SW, an application is modeled as a graph. To generate a
graph for each benchmark application, we utilized the Light Weight Virtual Machine (LLVM)
compiler and tool-chain to build the application graphs. LLVM is a set of modular compiler
and tool-chain technologies [134]. Its compilation mainly consists of two stages with optimization in between. These stages are the frontend and the backend. The frontend compiles
the code into an Intermediate Representation (IR) code. After a number of optimization
passes, the IR code is compiled by the backend into an assembly code corresponding to a
specific target architecture. We used the frontend of the LLVM Clang to generate the IR
code for the OpenCL benchmarks. Then we converted the IR code into CFG using LLVM
optimizer and analyzer as depicted in Figure 5.1. We also utilized the GraphViz tool [135]
to visualize the CFG at the Basic Block (BB) level. A BB is a sequence of contiguous instructions with one entrance and one exit. In order to evaluate the cost of each node of a

Figure 5.1: Application modeling leveraging the LLVM compiler.

graph, we need to assign cost parameters. These parameters include the execution time of
the node in SW, execution time in HW, SW energy consumption, HW energy consumption,
and HW area. In this study, we assigned five parameters that determine the total cost of each
node. These parameters include measured SW latency LSW , measured HW latency LHW ,

63

estimated SW energy ESW , estimated HW energy EHW , and estimated HW area AHW . For
each application graph, we divided the graph into components/nodes. Each component is
a set of BBs. The reason behind this grouping of BBs is to assign each node with a more
accurate measuring or estimation of the cost parameters. When assigning cost at the BB
level, most of the cost parameters measuring or estimation are not accurate. For instance,
we cannot accurately measure or estimate the execution time, energy consumption, or HW
area at the BB level because of the fine-granularity of many BBs. We call this grouping of
BBs as a component. Eventually, we have a graph for each application that consists of a set
of components and each component has five cost parameters. The cost parameters and the
measurement or estimation details are as follows:
• SW latency: measured the execution time of each component of the algorithms on the
CPU of one HARP’s node. These experiments were repeated ten times and the average
was taken to improve the accuracy.
• HW latency: measured the execution time of each component of the algorithms on the
FPGA of a HARP’s node.
• SW energy: estimated the energy consumption of each component using Thermal
Dissipate Power (TDP) information of the CPU.
• HW energy: estimated the energy consumption of the FPGA by breaking it into different components and finding the energy of each individual component. These components include Look Up Tables (LUTs), Flip-Flops (FFs), registers, Digital Signal
Processing (DSP) blocks, and Block RAM (BRAM). We reused the measured energy
of these individual components from the literature [136]–[138]. However, these resources assumed a different technology (90 nm and 65 nm) than HARP v2 (20 nm).
Thus, we scaled energy values to new technology. To accurately scale energy, we built
64

Table 5.1: Components cost of k-means algorithm.

Component
C1
C2
C3
C4
C5
C6
C7

LSW (µs)
2354
10
1313939
985455
1642424
656969
1970908

LHW (µs)
100
3
14154
10615
17692
7077
7077

AHW
213600
213600
209328
209328
222144
170880
222144

ESW (µj)
235400
1000
131393900
98545500
164242400
65696900
197090800

EHW (µj)
310
310
250
270
260
249
260

Table 5.2: Components cost of Canny edge detection algorithm.

Component
C1
C2
C3
C4
C5
C6
C7
C8
C9
C10
C11
C12

LSW (µs)
50
50
258
250
311
311
187
437
400
714
135
135

LHW (µs)
42
42
48
48
132
132
79
185
194
339
100
100

AHW
170880
170880
209328
209328
213600
213600
209328
222144
222144
226416
209328
209328

ESW (µs)
5000
5000
25800
25000
31100
31100
18700
43700
40000
71400
13500
13500

EHW (µs)
350
350
360
350
280
280
280
280
280
420
320
320

a Multi-variate Linear Regression (MLR) model to predict a proper scaling factor of
energy among the different nanometer technologies. We trained and tested the MLR
model on a data set generated using the CACTI 6.0 simulator [139], based on cache
simulation for different technologies supported by CACAI.
• HW area: estimated by Altera Offline Complier (AOC) resources utilization report.
Tables 5.1, 2, and 3 show the components of k-means, Canny, and AES respectively and
the cost parameters of each component.
65

Table 5.3: Components cost of advanced encryption standard algorithm.
Component
C1
C2
C3
C4
C5
C6
C7
C8
C9
C10
C11
C12
C13
C14
C15
C16

LSW (µs)
6
50
34
30
17
34
29
7
6
50
34
58
38
17
30
7

LHW (µs)
7
21
14
11
6
20
13
9
7
21
14
22
14
6
11
9

AHW
209328
128160
85440
205056
170880
128160
85440
209328
209328
128160
85440
128160
85440
205056
209328
209328

ESW (µj)
630
525
3400
3000
1700
3400
2900
700
600
5000
3400
5800
3800
1700
3000
700

EHW (µj)
300
180
180
170
160
160
170
300
300
180
180
170
160
160
170
300

5.2. Artificially Tuned PSO (APSO) Parameters
This section discusses the approach we followed to tune the PSO parameters using machine learning. It describes the process of generating a data set to train PSO, and the
training process of PSO. The influence of each of these parameters on the PSO algorithm is
as follows:
• Inertia value (w):
Swarm velocity plays a pivotal role in an efficient design space search [130]. Large
values of velocity allow the swarm to globally explore the space, where a local and
tight search is achieved with small values of velocity. The appropriate values of the
velocity, that ensure a balance between global and local search, are highly dependent
on the complexity of the search space. Moreover, a gradual change to the velocity is
required to ensure swarm convergence [130]. The inertia weight (w) controls a swarm’s

66

velocity in a way that provides supportive terms to the particle position in addition to
the particle best-known value pbest and the swarm’s best-known value gbest. Initial
value of w impacts the behavior of the PSO algorithm significantly by balancing the
exploration and exploitation of the search space and avert swarm divergence.
• Cognitive acceleration parameter (c1):
c1 represents the influence of a particle’s best known position on its velocity. Larger c1
values attract a particle toward its pbest away from gbest. To attain a certain level of
stochasticity that mimics real swarm movements, c1 is multiplied by a random vector
r1.
• Social acceleration parameter (c2):
Swarm intelligence based optimization algorithms rely on information dissemination
among swarm members. In PSO, particles communicate the swarm’s best-known position and update their velocity to approach this position. c2 is the attractor weight
toward a swarm’s optimum traversed position. c1 and c2 are set to 2 as acceptable
values in practical PSO for a wide range of applications [32], [41], [131], [133]. However,
in this work, we found that values other than the aforementioned values improve the
computational efficiency of PSO.

5.2.1. Generating Data-set for PSO Training
To train the PSO algorithm to tune its acceleration parameters, we generated a data set
for k-means,Canny, and AES. The data set size is 3200 records for each application with a
total of 9600 records for the three applications. Each record, in the data set, consists of the
number of for loops and if loops, the number of BBs, the number of branches, the number of
branch misprediction, and Cycle Per Instruction (CPI) that were found using iperf tool. In
67

addition, it includes the number of Compute Units (CUs) that represents the replication of
the OpenCL kernel in hardware. The reason is that these parameters affect the cost function.
The cost of the resulting partitioning decision and PSO execution time were calculated using
a range of possible acceleration parameters. For instance, we varied the inertia weight (w)
in the range [0 - 4] with a step size of 0.5 and the acceleration coefficients (c1 & c2) in the
range [0 - 10] with a step size of 0.5. Different values of the acceleration parameters and
inertia value affect the cost of PSO (gbest) and its execution time.
In order to tune the PSO parameters, a NN was used to predict these parameters. The NN
consists of two hidden layers with 32 nodes each as shown in Figure 5.2. We used 1000 epochs
for forward and backward propagation to adjust the weights. Adaptive Moment Estimation
(Adam) optimizer is used, among all other optimizers, since it combines good features of
other optimizers such as Adadelta and RMSprop [140]. The NN was evaluated using the
Root Mean Squared Error (RMSE) as a loss function.

5.3. Local Search-based Technique to Mitigate Premature Convergence
A Local search technique could be realized in different ways such as Hill Climbing (HC),
mutation and randomized walk. The main goal of a local search is to improve the divergence
of the population. This section discusses the local search technique that was used with
GA and PSO. We used a mutation-based local search technique. This means that every
candidate solution mutates some of its position values. In order to guarantee generating
fitter candidates, we compare the fitness of the newly generated candidate with the original
one and survive the fitter. Hence, there might be different variations of MA depending on
the local search technique. The pseudo code of MA is illustrated in algorithm 1. The PSO
algorithm with the Local search technique (LPSO) is depicted in algorithm 2.

68

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26

Data: population size p , cost function f , problem dimension d
Result: best solution
set M ax_iter, crossover rate, mutation type
for i ← 0 to p do
for j ← 0 to d do
initialize gene_ij with 0 or 1 ;
end
end
while i ≤ M ax_iter do
evaluate candidate solutions fitness using f ;
select parents to mating pool;
perform crossover operation on parents in mating pool and produce off-springs;
update the population (parents & new off-springs);
for j ← 0 to p do
Randomly mutate genes in solution_j ;
end
evaluate fitness of pop using f ;
M A_pop ← pop ;
for j ← 0 to p do
do local search for solution_j in MA_pop ;
evaluate fitness of solution_j in MA_pop ;
if solution_j in MA_pop < solution_j in pop then
solution_j in pop ← solution_j in MA_pop ;
end
end
end
evaluate the fitness of pop using f ;
return the fittest candidate solution ;
Algorithm 1: GA with local search mechanism extension (MA) [141].

69

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26

Data: population size p , cost function f , problem dimension d
Result: best solution
set M ax_iter, w, c1, c2
for i ← 0 to p do
for j ← 0 to d do
initialize position_ij with 0 or 1 ;
initialize velocity_ij ;
end
end
while i ≤Max_iter do
evaluate candidate solutions fattiness using f ;
for j ← 0 to p do
update particle_j best position ;
if particle_j_bestp osition < swarm_best_position then
swarm_best_position ← particle_j_best_position ;
end
end
evaluate fitness of pop using f ;
min( pop, p/2) ;
LPSO_pop ← P SO_best_solutions
for j ← 0top do
do local search for solution_j in LPSO_pop ;
evaluate fitness of solution_j in LPSO_pop ;
if solution_j in LPSO_pop < solution_j in pop then
solution_j in PSO_best solutions ← solution_j in LPSO_pop ;
end
end
end
Algorithm 2: PSO with local search mechanism extension (LPSO).

70

1

1

2

2

3

3

4

4

………………………………

………………………………

x1
x2

x3
x4
x5
x6
x7

Input
layer

y

Output
layer
30

30

31

31

32

32

Hidden layers

Figure 5.2: The structure of NN used in training PSO.

5.4. Results and Discussion
This section presents and discusses the results of different partitioning algorithms such
as GA, MA, PSO, APSO, and LPSO in terms of partitioning cost the partitioning latency.
The experiments were conducted using different numbers of iterations (10, 30, 60) and for
ten different sizes of the population (10-100) with a step of 10. Every single point in these
figures is an average of ten runs. The average was taken to mitigate the effect of the initial
randomized population, the randomization nature of these algorithms, and the variation of
randomness level used in each algorithm. The randomization makes these algorithms vary
71

on their convergence speed and accuracy. The experimental parameters of GA and PSO are
shown in Table 5.4.
Figures 5.3, 5.4, 5.5 show the partitioning cost of the k-means, textitCanny, and AES
Table 5.4: PSO and GA experimental parameters.
Parameter

Value
PSO
Population size
10, 20, 30, 40, 50, 60, 70, 80, 90, 100
Particle size
7, 12, 16
Maximum number of iterations
10, 30, 60
Cognitive value
2
Social value
2
Inertia weight
1 - 0.3
GA
Population size
10, 20, 30, 40, 50, 60, 70, 80, 90, 100
Chromosome dimensions
7, 12, 16
Maximum number of iterations
10, 30, 60
Mating pool size
3, 6, 8
Cross over rate
0.5
Mutation
single point

benchmarks using different sizes of the population and 10, 30, 60 iterations. Theses results
are shown in.1. For k-means benchmark, APSO outperforms PSO by up to 34.5% and
decrease the cost by 6.1% on average. This is because the APSO parameters are optimized
for each benchmark in a way that guides the swarm to a more accurate solution, where the
tuned values of k-means w, c1, and c2 are 0.5, 1, 0.5, respectively. LPSO also outperforms
PSO by up to 34.5% and decrease the cost by 6.7% on average, since LPSO conducts a more
extensive greedy search than PSO in the neighborhood of the leading particles in the swarm.
On the other hand, MA improves the fitness of the solution compared to GA by up to 82.6%
and decrease the cost by 64.3% on average. This is due to the ability of MA to improve
the divergence of population and replace some solutions in the population with lower cost
solutions. Moreover, GA gives lower cost partitioning when the size of the population is ten.

72

For the same number of iterations, increasing the size of GA population affect the quality
of the solution negatively. The reason is a larger population needs more iterations to reach
a better solution. Hence, GA with large population gives better results at larger number of
iterations as shown in the figure. Generally, increasing the size of GA population. As also
shown, for population sizes less than 70 increasing the number of iteration does not improve
the cost.
For Canny benchmark, APSO outperforms PSO by up to 62.9% and decrease the cost by
30.4% on average, it gives more accurate results for all different sizes of population. LPSO
also outperforms PSO by up to 55.4% and decrease the cost by 25.5% on average. On the
other hand, MA improves the cost of the solution by up to 26.3% and on average 18.7%
compared to GA. For AES benchmark, APSO reduces the cost of PSO by up to 17.6% and
on average 4.4%. LPSO also reduces the cost of PSO by up to 23.5% and on average by
5.1%. However, these algorithms give the same level of performance when the population
size is larger than 60. On the other hand, MA improves the solution cost compared to GA
by up to 40% and on average by 33%, and gives better results for all different sizes of the
population. Generally speaking, the performance of PSO, APSO, and LPSO improves when
increasing the size of the population. However, since there is a high level of stochasticity
in these algorithms, no guarantees can be given for continuous improvement in performance
when increasing the population size beyond a certain limit that varies among the different
benchmarks. As these algorithms are randomized, we can not draw a systematic conclusion
about the effect of increasing the population size and the number of iterations on their
performance. The randomness behavior of these algorithms comes from the initial random
population, the use of random vectors in their calculations such as r1 and r2 in PSO and
its variations. In addition, the degree of randomness in GA is even higher than PSO since
the genetic operators used by GA, such as crossover operation, are highly randomized.
73

Figure 5.3: Partitioning cost of GA, MA, PSO, APSO and LPSO using 10,
30, and 60 iterations and different sizes of population for k-means algorithm.

74

Cost

0

50000

100000

150000

200000

250000

300000

350000

400000

450000

500000

Iteration #10 and population size 10 -100

10 20 30 40 50 60 70 80 90 100

PSO

APSO

LPSO

GA

Iteration # 30 and population size 10 -100

10 20 30 40 50 60 70 80 90 100

k-means

0

MA

Iteration # 60 and population size 10 -100

10 20 30 40 50 60 70 80 90 100

50000000

100000000

150000000

200000000

250000000

300000000

Figure 5.4: Partitioning cost of GA, MA, PSO, APSO, and LPSO using 10,
20, and 30 iterations and different sizes of population for Canny algorithm.

75

Cost

10 20 30 40 50 60 70 80 90 100

iteration # 10 and population size 10 -100

0

20000

40000

60000

80000

100000

120000

140000

160000

PSO

APSO

LPSO

Iteration # 30 and population size 10 -100

10 20 30 40 50 60 70 80 90 100

Canny

GA

0

MA

Iteration # 60 and population size 10 -100

10 20 30 40 50 60 70 80 90 100

100000

200000

300000

400000

500000

600000

Figure 5.5: Partitioning cost of GA, MA, PSO, APSO, and LPSO using 10,
30, and 60 iterations and different sizes of population for AES algorithm.

76

Cost

10 20 30 40 50 60 70 80 90 100

iteration # 10 and population size 10 -100

0

50000

100000

150000

200000

250000

300000

PSO

APSO

LPSO

iteration # 30 and population size 10 -100

10 20 30 40 50 60 70 80 90 100

AES

GA

0

MA

iteration # 60 and population size 10 -100

10 20 30 40 50 60 70 80 90 100

50000

100000

150000

200000

250000

300000

350000

400000

450000

500000

Figures 5.6, 5.7, 5.8 show the latency of the partitioning algorithms for k-means, Canny,
and AES benchmarks. APSO algorithm has the lowest latency among all other algorithms
for all different numbers of iterations and population sizes. This is because APSO has
intelligently-tuned the acceleration parameters. APSO is faster than PSO by up to 29%.
On the other hand, PSO has lower latency than GA, MA, and LPSO. This is due to the
slow genetic operators taht are used in GA and MA and the local search extensions used
in MA and LPSO. As noticed, increasing the population size increases the latency of the
partitioning algorithms as this requires more computation and communication among the
population’s individuals during each iteration. Also, increasing the number of iterations
increases the latency of the algorithm significantly.
Table 5.5 compares the partitioning cost of GA, MA, PSO, LPSO, APSO, and the Exhaustive
Search (ES) algorithm. The ES algorithm explores the entire design space and finds the
optimal partitioning decision.
Table 5.5: Comparison of the partitioning cost among heuristics algorithms
and ES.

Partitioning al-

Partitioning cost

Partitioning cost

Partitioning cost

gorithm

of k-means

of Canny

of AES

GA

214707113

481758

425305

MA

72575756

412143

288333

PSO

451267

43819

196898

APSO

438438

26942

192118

LPSO

432023

29310

197110

ES

432023

24600

192118

77

Figure 5.6: Partitioning latency of GA, MA, PSO, APSO and LPSO using 10,
30, and 60 iterations and different sizes of population for k-means algorithm.

78

Latency (ms)
0

10

20

30

40

50

60

70

80

90

100

iteration # 10 and population size 10 -100

10 20 30 40 50 60 70 80 90 100

GA

PSO

APSO

MA

LPSO

iteration # 30 and population size 10 -100

10 20 30 40 50 60 70 80 90 100

k-means

iteration # 60 and population size 10 -100

10 20 30 40 50 60 70 80 90 100

0

100

200

300

400

500

600

700

800

900

Figure 5.7: Partitioning latency of GA, MA, PSO, APSO, and LPSO using
10, 20, and 30 iterations and different sizes of population for Canny algorithm.

79

Latency (ms)

10 20 30 40 50 60 70 80 90 100

10 20 30 40 50 60 70 80 90 100

10 20 30 40 50 60 70 80 90 100

600

60

iteration # 10 and population size 10 -100

0

GA

PSO

APSO

MA

iteration # 30 and population size 10 -100

LPSO

iteration # 30 and population size 10 -100

0

200

800

80

20

1000

100

400

1200

120

40

1400

Canny
140

Figure 5.8: Partitioning latency of GA, MA, PSO, APSO, and LPSO using
10, 30, and 60 iterations and different sizes of population for AES algorithm.

80

Latency (ms)
0

20

40

60

80

100

120

140

160

iteration # 10 and population size 10 -100

10 20 30 40 50 60 70 80 90 100

GA

PSO

APSO

MA

LPSO

iteration # 30 and population size 10 -100

10 20 30 40 50 60 70 80 90 100

AES

iteration # 60 and population size 10 -100

10 20 30 40 50 60 70 80 90 100

0

200

400

600

800

1000

1200

1400

Figures 5.9, 5.10, and 5.11 illustrate the execution time and the energy consumption of
partitioning solutions of k-means, Canny, and AES benchmarks, respectively. These solutions capture the behavior of the different algorithms before converging to the same point.
We used the MOO utility function that is illustrated in Chapter 3. This function is a weighted
sum of execution time, energy consumption, and HW area. Hence, it might happen that the
solution with the lowest total cost, does not guarantee the lowest execution time or the lowest energy consumption. For the K-means benchmark, since the components execution time
on SW is very large, compared to HW area values, the partitioning algorithm moves more
components to HW. This makes solutions with lower total cost have also lower execution
time and energy consumption. On the other hand, In Canny and AES benchmarks, the SW
execution time is small compared to the HW area. This makes the partitioning algorithm
offloads more components to SW to reduce the total cost. In this scenario, a solution with
lower total cost might have a larger execution time and energy consumption.
Figure 5.12 compares the performance of the partitioning algorithms among the three benchmarks. For each population size, it shows the average run of the three iterations (10, 30, and
60) using a log scale. It shows that the relative performance of the different algorithms is
the same for the three benchmarks. It also shows that different algorithms might give better
performance for some of the benchmarks over the other. For instance, APSO deliver the best
performance for the Canny benchmark. In combination with Figure 5.13, we can conclude
that increasing the number of iteration is not always beneficial, at least for benchmarks with
these sizes. However, when increasing the population size, we need to increase the number of
iteration to improve performance. To verify the accuracy of a solution’s estimated cost and
execution time with the actual measurements on hardware, we implemented two solutions
on HARP and measured the actual execution time. The results show an overhead in the
execution time of 3% and 2.9% for the two experiments, which is due to communication.
81

k-means
100000000

1000000

10000000
100000
1000000

Time (ms)

10000

1000

1000

100

Energy (mj)

10000

100000

100
10

10

1

1
GA

MA

PSO

APSO

LPSO

Partitioning algorithm
time

energy

Figure 5.9: Execution time and energy consumption of GA, MA, PSO, APSO,
and LPSO using k-means algorithm.

Canny
10000

1000000
100000

1000

100

1000
100

Energy (mj)

Time (ms)

10000

10
10

1

1
GA

MA

PSO

APSO

LPSO

Partitioning algorithm

time

energy

Figure 5.10: Execution time and energy consumption of GA, MA, PSO, APSO,
and LPSO using Canny algorithm.

82

AES
1000

100000

10000
100

100
10

Energy (mj)

Time (ms)

1000

10

1

1
GA

MA

PSO

APSO

LPSO

Partitioning algorithm

time

energy

Figure 5.11: Execution time and energy consumption of GA, MA, PSO, APSO,
and LPSO using AES algorithm.

83

Figure 5.12: Partitioning cost of GA, MA, PSO, APSO, and LPSO the average
of 10, 30, and 60 iterations and different sizes of population for k-means,
Canny, and AES algorithms.

84

Cost

1

10

100

1000

10000

100000

1000000

K-means, population size 10 -100

10 20 30 40 50 60 70 80 90 100

PSO

APSO

LPSO

Canny, population size 10 -100
GA

10 20 30 40 50 60 70 80 90 100

K-means, Canny, and AES

MA

AES, population size 10 -100

10 20 30 40 50 60 70 80 90 100

1

10

100

1000

10000

100000

1000000

10000000

100000000

1E+09

Figure 5.13: Partitioning cost of GA, MA, PSO, APSO, and LPSO using 10,
30, and 60 iterations and the average of ten different populations for k-means,
Canny, and AES algorithms.

85

1

10

100

1000

10000

100000

1000000

Cost

30

60

k-means, iteration # 10, 30, 60

10

PSO

30

60

APSO

LPSO

Canny, iteration # 10, 30, 60

10

k-means, Canny, and AES

GA

MA

30

60
AES, iteration # 10, 30, 60

10

1

10

100

1000

10000

100000

1000000

10000000

100000000

1E+09

CHAPTER 6
CODE AND DATA PARTITIONING
In the HW/SW partitioning, each node in the application graph is completely assigned
either to the CPU or the FPGA. In addition, the CPU and the FPGA in emerging CPUFPGA architectures have tighter integration than before, which allows faster communication
between the CPU and the FPGA. Moreover, the data-parallel nature of many cloud applications makes it possible for both the CPU and the FPGA to execute the same task with
different data-sets simultaneously. In this chapter, we propose and implement a variation of
the PSO algorithm that partitions the code and the data of an application graph between
the CPU and the FPGA by assigning some nodes to both devices with different data sets.
Nodes that could be assigned to both devices must be compute-intensive with a high level
of data parallelism.

6.1. Code-Data Partitioning PSO (CDPSO)
A vast range of cloud applications such as image/video processing applications have a
data-parallel nature. In data-parallel programming, the user can distribute the application
data-set among many homogeneous or heterogeneous processors. In this case, all these
processors will perform the same task over different portions of the data concurrently. This
approach improves the execution time and guarantees a more efficient utilization of system
resources [142]. In this chapter, we propose and implement a variation of the PSO algorithm
that partitions the nodes of an application graph into three categories. These categories
are nodes that are assigned to CPU, nodes that are assigned to FPGA, and nodes that are
assigned to both CPU and FPGA at the same time. For nodes that are assigned to both the
86

Figure 6.1: HW/SW partitioning using PSO (upper) and CDPSO (lower).

CPU and the FPGA, the data is distributed between CPU and FPGA using the approach
followed in Chapter 4.
In terms of the partitioning algorithms, 0 indicates that the node is assigned to CPU,
while 1 indicates that the node is assigned to FPGA. In code and data partitioning, we used
2 to indicate that the node is assigned to both the CPU and the FPGA. We used Canny
graph to demonstrate the superiority of CDPSO over the PSO because this algorithm has
many data-parallel functions that could be assigned to the CPU and the FPGA at the
same time with different portions of the data. However, for the other benchmarks, nodes
can not be duplicated on the CPU and the FPGA at the same time due to fine-grain data
dependency. Figure 6.1 shows the difference between a candidate partitioning solution of
PSO and CDPSO. The total cost of PSO solution is 476,152 while the CDPSO solution cost
is 345,773, which means an improvement of the total cost by 27%. The reason behind this
reduction in partitioning cost is that CDPSO duplicates the data-parallel nodes on the CPU
and the FPGA. Duplicating a task on the CPU and the FPGA reduces the execution time
significantly, which has the highest weight in our cost function. The pseudo code of CDPSO
is illustrated in Algorithm 3. The ratio at which the data is partitioned between the CPU
and the FPGA is platform dependent. This means it varies with the variation of the CPU
or the FPGA performance characteristics since the execution time ratio of the CPU and the
FPGA might be different among different platforms. The wenergy is 0.6, and the warea is 0.3
as discussed in Chapter 3.

87

Data: swarm size s, application graph g
Result: the best solution of partitioning a workload between SW and HW at both code
and data levels
1

set M ax_iter, w, c1, c2

2

formulate a cost function:

3

f = max(wCP U ∗ SWL , wF P GA ∗ HWL ) + wenergy (wCP U ∗ SWenergy + wF P GA ∗
HWenergy ) + warea ∗ HWarea .

4

for i ← 0 to s do

5

initialize particle_i positions with 0, 1, and 2

6

initialize particle_i velocities

7

end

8

while i ≤Max_iter do

9
10

evaluate candidate solutions fattiness using f ;
for j ← 0 to p do

11

update particle_j best position;

12

if particle_j_best_position < swarm_best_position then

13

swarm_best_position ← particle_j_best_position;

14

end

15

update particle_i positions update particle_i velocities

16
17

end
end

18

Algorithm 3: Code and Data partitioning PSO (CDPSO).
To realize this partitioning technique, we have made the following assumptions that we
concluded from our previous work [104]. First, if a node or a task could be partitioned
between the CPU and the FPGA, the CPU is assigned one-third (wCP U ) of the data set
88

and the FPGA is assigned two thirds (wF P GA ). We calculated this ratio using the CPU and
the FPGA weights equation in Chapter 4 page 48. Second, the execution time and energy
consumption are estimated to consider only the assigned data portion as shown in the cost
function calculation in the algorithm 3. Third, the user indicates the data-parallel nodes
of the application. Fourth, to be able to distribute the data set between the CPU and the
FPGA, the total cost of the duplicated node on the CPU and the FPGA has to be lower than
the total cost on the CPU-only and the total cost on the FPGA-only. This means that the
partitioning algorithm compares these three values before duplicating a data-parallel node
as shown in relation 6.1.

T he total cost of the duplicate node <

(6.1)

min{the total cost of the node on CP U, the total cost of the node on F P GA}

6.2. Results and Discussion
This section discusses the results of CDPSO and compares them to GA, MA, PSO, APSO,
and LPSO. Figure 6.2 compares the average performance of CDPOS to other partitioning
algorithms for different sizes of populations (10 - 100) with a step of 10 and the number of
iterations is 60. The results demonstrate the superiority of CDPSO over GA, MA, and PSO.
This is due to the CDPSO task duplication mechanism on the CPU and the FPGA, which
reduces the total cost of the partitioning solution. Duplicating data-parallel nodes on the
CPU and the FPGA reduces the total cost of the partitioning results. CDPSO total cost
improvement over PSO reaches up to 33%. The partitioning algorithms aim at minimizing
the utility function, which reflects the total cost. The reduction in the total cost does not
always mean a reduction in execution time. This is because of the trade-off between the
89

execution time and the HW area. PSO and CDPSO give the same cost when the size of
the population is larger than 70. This is because increasing the number of individuals in a
population makes the algorithm converges to a sub-optimal solution.
Figure 6.3 compares the execution time and energy consumption of a partitioning solution among the partitioning algorithms. It shows that CDPSO outperforms PSO and its
variations in terms of execution time. This is because the duplication of a node has a higher
effect on the execution time. CDPSO also has a lower energy consumption than the PSO
algorithm and its variations.

Canny
140000

600000

120000

500000

100000

400000

Cost

80000
300000
60000
200000
40000
100000

20000

0

0
10

20

30

40

50

60

70

80

90

100

Population size
PSO

APSO

LPSO

CDPSO

GA

MA

Figure 6.2: HW/SW partitioning cost of GA, MA, PSO, APSO, LPSO, and
CDPSO.

90

10000

1000000

100000

1000

100

1000

100

Energy (mj)

Time (ms)

10000

10
10

1

1

GA

MA

PSO

APSO

LPSO

CDPSO

Partitioning algorithm
time

energy

Figure 6.3: Execution time and energy consumption of GA, MA, PSO, APSO,
LPSO, and CDPSO.

91

CHAPTER 7
CONCLUSION AND FUTURE WORK
This chapter concludes and summarizes the dissertation. It also provides insights into
future work.

7.1. Summary and Conclusion
Heterogeneous CPU-FPGA platforms have been recently deployed in data centers and
cloud environments to improve performance and reduce energy consumption of applications.
Many of these platforms have tighter integration between the CPU and the FPGA than
the traditional off-chip PCIe-based FPGA accelerator [143]. The conventional offloading of
the entire kernel to the FPGA is not always efficient since it leaves the CPU underutilized.
Moreover, for cloud-scale workload, the consumption of logic resources exceeds the capacity
of a FPGA chip. Hence, an efficient partitioning technique of the workload between the CPU
and the FPGA is a crucial demand to improve system utilization, performance, and reduce
energy consumption. In this study, we proposed and implemented synergic partitioning techniques for a compute-intensive image processing algorithm. We partitioned OpenCL-based
implementation of Canny algorithm between the CPU and the FPGA using two mechanisms,
code partitioning and data partitioning. In the code partitioning technique, the code was
partitioned between the CPU and the FPGA, and resulted in up to 2.3X speedup compared
to a CPU-only implementation. In the data partitioning technique, both the CPU and the
FPGA executed all the stages of the algorithm on different portions of the data. The data
partitioning technique outperforms both CPU-only and FPGA-only implementations by up
to 4.8X and 2.1X respectively. It also results in 55.3% reduction in energy consumption, on
92

average, compared to the CPU-only implementation. In addition, its energy-delay product
is comparable to the FPGA-only implementation.
Nature-inspired optimization algorithms such as Genetic Algorithm (GA) and Particle
Swarm Optimization (PSO) have been used for HW/SW partitioning. In this study, we tackled some limitations of these algorithms, which are fixed values of PSO parameters, and the
premature convergence of GA and PSO. The accuracy and execution time of PSO algorithm
depends on the parameters it uses in its search equations. The appropriate values of these
parameters might vary from application to another. However, researchers and practitioners
have used generally accepted parameters. In this study, we tuned these parameters using
a Neural Network (NN) in order to improve the performance of PSO. The dataset, which
was used to train the NN, was generated by executing PSO on our benchmarks graphs. The
features reflect source-level and IR-level characteristics of the benchmarks, in addition to
platform characteristics. This technique improves the accuracy of PSO by 62.8% and its
execution time by up to 29%.
We also proposed and implemented a distributed greedy local search technique to mitigate
the premature convergence problem. This technique mutates individuals with poor quality
and replaces them with superior individuals. We applied a mutation-based local search technique to GA and PSO. Although GA is a robust optimization algorithm, it extensively suffers
from premature convergence. GA with a mutation-based search, aslo known as Memtic Algorithm (MA), reduces the premature convergence problem. Our variation of MA outperforms
GA by 82.6% of the total cost. The extended PSO algorithm with mutation-based search
technique outperforms PSO by up to 55.4% of the total cost.
Finally, to boost the system performance, we proposed a prototype for a concurrent
code and data portioning technique with a simple implementation. Using this technique, we
duplicated data-parallel nodes of the application graph on the CPU and the FPGA at the
93

same time. The Code and Data partitioning PSO (CDPSO) outperforms the PSO algorithm
by up to 33% of the total cost. We used a Multi Objective Optimization (MOO) utility
function that consists of weighted sum of execution time, energy consumption, and HW
resource utilization.

7.2. Future Works
This work can be extended as follows:
• Include more benchmarks to cover disparate cloud application domains.
• Compare our results with other heuristic partitioning algorithms such as Grey Wolf
Optimizer (GWO).
• Study different heterogeneous CPU-FPGA and CPU-GPU platforms.
• Build a model to predict the best number of iterations and the best population size
for different benchmarks.
• Utilize different heuristic techniques to mitigate the premature convergence problem.

94

BIBLIOGRAPHY
[1] F. Chen, Y. Shan, Y. Zhang, Y. Wang, H. Franke, X. Chang, and K. Wang, “Enabling fpgas in the cloud”, in Proceedings of the 11th ACM Conference on Computing
Frontiers, 2014, p. 3.
[2] Y. Pu, J. Peng, L. Huang, and J. Chen, “An efficient knn algorithm implemented on
fpga based heterogeneous computing system using opencl”, IEEE 23rd Annual International Symposium on Field-Programmable Custom Computing Machines, pp. 167–
170, 2015.
[3] The 27th IEEE International Symposium on Field-Programmable Custom Computing Machines, en-US. [Online]. Available: https : / / www . fccm . org/ (visited on
02/09/2019).
[4] F. Milliot, When Are FPGAs the Right Choice to Improve HPC Performance, en-US.
[Online]. Available: http://tinyurl.com/m6q4yvl (visited on 02/06/2018).
[5] Microsoft Uses Intel FPGAs for Smarter Bing Searches. [Online]. Available: https:
//www.eweek.com/cloud/microsoft- uses- intel- fpgas- for- smarter- bingsearches (visited on 02/24/2019).
[6] A. Shaout, A. H. El-Mousa, and K. Mattar, “Specification and modeling of hw/sw
co-design for heterogeneous embedded systems”, in Proceedings of the World Congress
on Engineering, vol. 1, 2009.
[7] I. Mhadhbi, S. B. Othman, and S. B. Saoud, “A comprehensive survey on hardware/software partitioning process in co-design”, International Journal of Computer
Science and Information Security, vol. 14, no. 3, p. 263, 2016.
[8] A. Bhattacharya, A. Konar, S. Das, C. Grosan, and A. Abraham, “Hardware software
partitioning problem in embedded system design using particle swarm optimization
algorithm”, in Complex, Intelligent and Software Intensive Systems, CISIS International Conference on, 2008, pp. 171–176.
[9] P. Liu, J. Wu, and Y. Wang, “Hybrid algorithms for hardware/software partitioning
and scheduling on reconfigurable devices”, Mathematical and Computer Modelling,
vol. 58, no. 1-2, pp. 409–420, 2013.
[10] I. Ahmad, M. K. Dhodhi, and I. Ahmad, “Multiprocessor scheduling by simulated
evolution”, Journal of Software, vol. 5, no. 10, pp. 1128–1136, 2010.
[11] Y.-K. Kwok and I. Ahmad, “Benchmarking and comparison of the task graph scheduling algorithms”, Journal of Parallel and Distributed Computing, vol. 59, no. 3, pp. 381–
422, 1999.

95

[12] H. Topcuoglu, S. Hariri, and M.-y. Wu, “Performance-effective and low-complexity
task scheduling for heterogeneous computing”, IEEE transactions on parallel and
distributed systems, vol. 13, no. 3, pp. 260–274, 2002.
[13] E. Sha, L. Wang, Q. Zhuge, J. Zhang, and J. Liu, “Power efficiency for hardware/software partitioning with time and area constraints on mpsoc”, International Journal of
Parallel Programming, vol. 43, no. 3, pp. 381–402, 2015.
[14] J. Wu, P. Wang, S.-K. Lam, and T. Srikanthan, “Efficient heuristic and tabu search
for hardware/software partitioning”, The Journal of Supercomputing, vol. 66, no. 1,
pp. 118–134, 2013.
[15] P. Arató, Z. Á. Mann, and A. Orbán, “Algorithmic aspects of hardware/software
partitioning”, ACM Transactions on Design Automation of Electronic Systems (TODAES), vol. 10, no. 1, pp. 136–156, 2005.
[16] L. Li, J. Sun, W. Li, Z. Lv, and F. Guan, “Hardware/software partitioning based on
hybrid genetic and tabu search in the dynamically reconfigurable system”, International Journal of Control and Automation, vol. 8, no. 1, pp. 29–36, 2015.
[17] P. K. Nath and D. Datta, “Multi-objective hardware–software partitioning of embedded systems: A case study of jpeg encoder”, Applied Soft Computing, vol. 15, pp. 30–
41, 2014.
[18] K. S. Khouri, G. Lakshminarayana, and N. K. Jha, “Fast high-level power estimation
for control-flow intensive design”, in Proceedings of the international symposium on
Low power electronics and design, 1998, pp. 299–304.
[19] A. Baghdadi, N.-E. Zergainoh, W. O. Cesario, and A. A. Jerraya, “Combining a performance estimation methodology with a hardware/software codesign flow supporting
multiprocessor systems”, IEEE Transactions on Software Engineering, vol. 28, no. 9,
pp. 822–831, 2002.
[20] S. Shekhar, Y. Barve, and A. Gokhale, “Understanding performance interference
benchmarking and application profiling techniques for cloud-hosted latency-sensitive
applications”, in Proceedings of the 10th International Conference on Utility and
Cloud Computing, 2017, pp. 187–188.
[21] Altera SDK for OpenCL Best Practices Guide. [Online]. Available: https://www.
intel . com / content / dam / www / programmable / us / en / pdfs / literature / hb /
opencl-sdk/aocl-best-practices-guide.pdf.
[22] V. Tiwari, S. Malik, and A. Wolfe, “Power analysis of embedded software: A first
step towards software power minimization”, IEEE Transactions on Very Large Scale
Integration (VLSI) Systems, vol. 2, no. 4, pp. 437–445, 1994.
[23] J. Liu, E. Ahmed, M. Shiraz, A. Gani, R. Buyya, and A. Qureshi, “Application
partitioning algorithms in mobile cloud computing: Taxonomy, review and future
directions”, Journal of Network and Computer Applications, vol. 48, pp. 99–117, 2015.
96

[24] P. Eles, Z. Peng, K. Kuchcinski, and A. Doboli, “System level hardware/software
partitioning based on simulated annealing and tabu search”, Design automation for
embedded systems, vol. 2, no. 1, pp. 5–32, 1997.
[25] P. Arató, S. Juhász, Z. Á. Mann, A. Orbán, and D. Papp, “Hardware-software partitioning in embedded system design”, in Intelligent Signal Processing, IEEE International Symposium on, 2003, pp. 197–202.
[26] R. Niemann and P. Marwedel, “Hardware/software partitioning using integer programming”, in Proceedings of the European conference on Design and Test, 1996,
p. 473.
[27] ——, “An algorithm for hardware/software partitioning using mixed integer linear
programming”, Design Automation for Embedded Systems, vol. 2, no. 2, pp. 165–193,
1997.
[28] S.-R. Kuang, C.-Y. Chen, and R.-Z. Liao, “Partitioning and pipelined scheduling
of embedded system using integer linear programming”, in Parallel and Distributed
Systems, Proceedings 11th International Conference on, vol. 2, 2005, pp. 37–41.
[29] W. Zuo, L.-N. Pouchet, A. Ayupov, T. Kim, C.-W. Lin, S. Shiraishi, and D. Chen, “Accurate high-level modeling and automated hardware/software co-design for effective
soc design space exploration”, in Proceedings of the 54th Annual Design Automation
Conference, 2017, pp. 78–88.
[30] S.-A. Li, C.-C. Hsu, C.-C. Wong, and C.-J. Yu, “Hardware/software co-design for particle swarm optimization algorithm”, Information Sciences, vol. 181, no. 20, pp. 4582–
4596, 2011.
[31] X. Yan, F. He, N. Hou, and H. Ai, “An efficient particle swarm optimization for
large-scale hardware/software co-design system”, International Journal of Cooperative
Information Systems, p. 1 741 001, 2017.
[32] X.-H. Yan, F.-Z. He, and Y.-L. Chen, “A novel hardware/software partitioning method
based on position disturbed particle swarm optimization with invasive weed optimization”, Journal of Computer Science and Technology, vol. 32, no. 2, pp. 340–355, 2017.
[33] Y. M. Tavares, N. Nedjah, and L. de Macedo Mourelle, “Hardware/software co-design
system for template matching using particle swarm optimization and pearson’s correlation coefficient”, in Computational Intelligence (LA-CCI), IEEE Latin American
Conference on, 2016, pp. 1–6.
[34] E. T. Tan and Z. A. Halim, “Performance evaluation of genetic algorithm to solve
hardware-software partitioning design: A factorial design analysis”, in Region 10 Conference, TenCon IEEE, 2017, pp. 439–442.
[35] N. Hou, F. He, Y. Zhou, Y. Chen, and X. Yan, “A parallel genetic algorithm with
dispersion correction for hw/sw partitioning on multi-core cpu and many-core gpu”,
IEEE Access, vol. 6, pp. 883–898, 2018.
97

[36] Y. Jiang, H. Zhang, X. Jiao, X. Song, W. N. Hung, M. Gu, and J. Sun, “Uncertain
model and algorithm for hardware/software partitioning”, in VLSI (ISVLSI), IEEE
Computer Society Annual Symposium on, 2012, pp. 243–248.
[37] L. An, F. Jinfu, L. Xiaolong, and Y. Xiaotian, “Algorithm of hardware/software
partitioning based on genetic particle swarm optimization”, Journal of ComputerAided Design & Computer Graphics, vol. 6, p. 005, 2010.
[38] I. Mhadhbi, S. Ben Othman, and S. Ben Saoud, “An efficient technique for hardware/software partitioning process in codesign”, Scientific Programming, vol. 2016,
p. 10, 2016.
[39] Y. Wu, H. Zhang, and H. Yang, “Research on parallel hw/sw partitioning based on
hybrid pso algorithm”, in International conference on algorithms and architectures
for parallel processing, 2009, pp. 449–459.
[40] T. Eimuri and S. Salehi, “Using dpso and b&b algorithms for hardware/software
partitioning in co-design”, in Second international conference on computer research
and development, 2010, pp. 416–420.
[41] X. Yan, F. He, N. Hou, and H. Ai, “An efficient particle swarm optimization for
large-scale hardware/software co-design system”, International Journal of Cooperative
Information Systems, vol. 27, no. 01, p. 1 741 001, 2018.
[42] G. Lin, W. Zhu, and M. M. Ali, “A tabu search-based memetic algorithm for hardware/software partitioning”, Mathematical Problems in Engineering, vol. 2014, 2014.
[43] C. Zhang, R. Chen, and V. Prasanna, “High throughput large scale sorting on a
cpu-fpga heterogeneous platform”, in Parallel and Distributed Processing Symposium
Workshops, IEEE International, 2016, pp. 148–155.
[44] G. Busonera, S. Carta, A. Marongiu, and L. Raffo, “Automatic application partitioning on fpga/cpu systems based on detailed low-level information”, in Digital System
Design: Architectures, Methods and Tools, 9th EUROMICRO Conference on, 2006,
pp. 265–268.
[45] K. Atasu, L. Pozzi, and P. Ienne, “Automatic application-specific instruction-set extensions under microarchitectural constraints”, International Journal of Parallel Programming, vol. 31, no. 6, pp. 411–428, 2003.
[46] M. Brogioli, P. Radosavljevic, and J. R. Cavallaro, “Hardware/software co-design
methodology and dsp/fpga partitioning: A case study for meeting real-time processing
deadlines in 3.5 g mobile receivers”, in IEEE International Midwest Symposium on
Circuits and Systems, 2006.
[47] M. Riabi, Y. Manai, and J. Haggege, “Hardware/software codesign approach for heterogeneous mpsoc system”, International Journal of Computer Science and Network
Security, vol. 18, no. 1, pp. 10–17, 2018.
98

[48] T. Zhang, X. Zhao, X. An, H. Quan, and Z. Lei, “Using blind optimization algorithm
for hardware/software partitioning”, IEEE Access, vol. 5, pp. 1353–1362, 2017.
[49] Z. Wang, B. He, and W. Zhang, “A study of data partitioning on opencl-based fpgas”,
in Field Programmable Logic and Applications (FPL) 25th International Conference
on, 2015, pp. 1–8.
[50] G. Stitt, F. Vahid, and S. Nematbakhsh, “Energy savings and speedups from partitioning critical software loops to hardware in embedded systems”, ACM Transactions
on Embedded Computing Systems (TECS), vol. 3, no. 1, pp. 218–232, 2004.
[51] J. Henkel, “A low power hardware/software partitioning approach for core-based embedded systems”, in Proceedings of the 36th annual ACM/IEEE Design Automation
Conference, 1999, pp. 122–127.
[52] I. Ghribi, R. B. Abdallah, M. Khalgui, Z. Li, K. Alnowibet, and M. Platzner, “Rcodesign: Codesign methodology for real-time reconfigurable embedded systems under
energy constraints”, IEEE Access, vol. 6, pp. 14 078–14 092, 2018.
[53] W. Jiang and V. K. Prasanna, “A fpga-based parallel architecture for scalable highspeed packet classification”, in Application-specific Systems, Architectures and Processors, 20th IEEE International Conference on, 2009, pp. 24–31.
[54] J. L. Tripp, A. A. Hanson, M. Gokhale, and H. Mortveit, “Partitioning hardware and
software for reconfigurable supercomputing applications: A case study”, in Proceedings
of the ACM/IEEE conference on Supercomputing, 2005, p. 27.
[55] M. Brogioli, P. Radosavljevic, and J. R. Cavallaro, “Hardware/software co-design
methodology and dsp/fpga partitioning: A case study for meeting real-time processing
deadlines in 3.5 g mobile receivers”, in IEEE International Midwest Symposium on
Circuits and Systems, 2006.
[56] J. H. Kelm, I. Gelado, M. J. Murphy, N. Navarro, S. Lumetta, and W.-m. Hwu, “Cigar:
Application partitioning for a cpu/coprocessor architecture”, in Proceedings of the
16th International Conference on Parallel Architecture and Compilation Techniques,
2007, pp. 317–326.
[57] F. Ghaffari, M. Benjemaa, and M. Auguin, “Algorithms for the partitioning of applications containing variable duration tasks on reconfigurable architectures”, in Computer
Systems and Applications, Book of Abstracts. ACS/IEEE International Conference
on, 2003, p. 13.
[58] Y.-K. Kwok, A. A. Maciejewski, H. J. Siegel, I. Ahmad, and A. Ghafoor, “A semistatic approach to mapping dynamic iterative tasks onto heterogeneous computing
systems”, Journal of Parallel and Distributed Computing, vol. 66, no. 1, pp. 77–98,
2006.

99

[59] R. Lysecky and F. Vahid, “A study of the speedups and competitiveness of fpga soft
processor cores using dynamic hardware/software partitioning”, in Proceedings of the
conference on Design, Automation and Test in Europe, 2005, pp. 18–23.
[60] G. Stitt, R. Lysecky, and F. Vahid, “Dynamic hardware/software partitioning: A first
approach”, in Proceedings of the 40th annual Design Automation Conference, 2003,
pp. 250–255.
[61] R. Lysecky and F. Vahid, “A configurable logic architecture for dynamic hardware/software partitioning”, in Design, Automation and Test in Europe Conference and
Exhibition, Proceedings, vol. 1, 2004, pp. 480–485.
[62] P. Eles, Z. Peng, K. Kuchcinski, and A. Doboli, “System level hardware/software
partitioning based on simulated annealing and tabu search”, Design automation for
embedded systems, vol. 2, no. 1, pp. 5–32, 1997.
[63] M. Gately, Y. Zhai, M. Yeary, E. Petrich, and L. Sawalha, “A three-dimensional
swept volume display based on led arrays”, Journal of Display Technology, vol. 7,
no. 9, pp. 503–514, 2011.
[64] G. Zhong, A. Dubey, T. Cheng, and T. Mitra, “Synergy: A hw/sw framework for high
throughput cnns on embedded heterogeneous soc”, arXiv preprint arXiv:1804.00706,
2018.
[65] N. Govil and S. R. Chowdhury, “High performance and low cost implementation of
fast fourier transform algorithm based on hardware software co-design”, in Region 10
Symposium, IEEE, 2014, pp. 403–407.
[66] S. Ossif, “Applications of fpgas in high-performance adaptive channel equalization”,
2016.
[67] F. B. Muslim, A. Demian, L. Ma, L. Lavagno, and A. Qamar, “Energy-efficient fpga
implementation of the k-nearest neighbors algorithm using opencl.”, in FedCSIS Position Papers, 2016, pp. 141–145.
[68] S. H. Gerez, Algorithms for VLSI design automation. Wiley New York, 1999, vol. 8.
[69] S. Yousuf and A. Gordon-Ross, “An automated hardware/software co-design flow for
partially reconfigurable fpgas”, in VLSI (ISVLSI), IEEE Computer Society Annual
Symposium on, 2016, pp. 30–35.
[70] A. Iguider, M. Chami, O. Elissati, and A. En-Nouaary, “Embedded systems hw/sw
partitioning based on lagrangian relaxation method”, in Proceedings of the Mediterranean Symposium on Smart City Applications, 2017, pp. 149–160.
[71] T. Streichert, C. Haubelt, and J. Teich, “Online hardware/software partitioning in
networked embedded systems”, in Proceedings of the Asia and South Pacific Design
Automation Conference, 2005, pp. 982–985.
100

[72] G. Stitt and F. Vahid, “Energy advantages of microprocessor platforms with on-chip
configurable logic”, IEEE Design & Test of Computers, vol. 19, no. 6, pp. 36–43, 2002.
[73] ——, “A decompilation approach to partitioning software for microprocessor/fpga
platforms”, in Proceedings of the conference on Design, Automation and Test in Europe, 2005, pp. 396–397.
[74] P. Waldeck and N. Bergmann, “Dynamic hardware-software partitioning on reconfigurable system-on-chip”, in System-on-Chip for Real-Time Applications, Proceedings.
The 3rd IEEE International Workshop on, 2003, pp. 102–105.
[75] N. N. Bình, M. Imai, A. Shiomi, and N. Hikichi, “A hardware/software partitioning
algorithm for designing pipelined asips with least gate counts”, in Proceedings of the
33rd annual Design Automation Conference, 1996, pp. 527–532.
[76] P. Eles, K. Kuchcinski, Z. Peng, and A. Doboli, “Hardware/software partitioning
of vhdl system specifications”, in Proceedings of the conference on European design
automation, 1996, pp. 434–439.
[77] R. R. Schaller, “Moore’s law: Past, present, and future”, IEEE Spectr., vol. 34, no. 6,
pp. 52–59, 1997.
[78] P. V. Knudsen and J. Madsen, “Pace: A dynamic programming algorithm for hardware/software partitioning”, in Proceedings of the 4th International Workshop on
Hardware/Software Co-Design, 1996, p. 85.
[79] J. Wu and T. Srikanthan, “Low-complex dynamic programming algorithm for hardware/software partitioning”, Information processing letters, vol. 98, no. 2, pp. 41–46,
2006.
[80] K. S. Chatha and R. Vemuri, “Hardware-software partitioning and pipelined scheduling of transformative applications”, IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 10, no. 3, pp. 193–208, 2002.
[81] T. Wiangtong, P. Y. Cheung, and W. Luk, “Comparing three heuristic search methods
for functional partitioning in hardware–software codesign”, Design Automation for
Embedded Systems, vol. 6, no. 4, pp. 425–449, 2002.
[82] F. Vahid, D. D. Gajski, and J. Gong, “A binary-constraint search algorithm for minimizing hardware during hardware/software partitioning”, in Proceedings of the conference on European design automation, 1994, pp. 214–219.
[83] D. Wang, S. Li, and Y. Dou, “Collaborative hardware/software partition of coarsegrained reconfigurable system using evolutionary ant colony optimization”, in Proceedings of the Asia and South Pacific Design Automation Conference, 2008, pp. 679–
684.
[84] V. Catania, M. Malgeri, and M. Russo, “Applying fuzzy logic to codesign partitioning”, IEEE Micro, vol. 17, no. 3, pp. 62–70, 1997.
101

[85] M. Koudil, K. Benatchba, A. Tarabet, and E. B. Sahraoui, “Using artificial bees to
solve partitioning and scheduling problems in codesign”, Applied Mathematics and
Computation, vol. 186, no. 2, pp. 1710–1722, 2007.
[86] G. Stitt and F. Vahid, “Hardware/software partitioning of software binaries”, in Proceedings of the IEEE/ACM international conference on Computer-aided design, 2002,
pp. 164–170.
[87] P. S. B. do Nascimento and M. E. de Lima, “Temporal partitioning for image processing based on time-space complexity in reconfigurable architectures”, in Design,
Automation and Test in Europe, DATE’06. Proceedings, vol. 1, 2006, pp. 1–6.
[88] N. Govil, R. Shrestha, and S. R. Chowdhury, “Pgma: An algorithmic approach for
multi-objective hardware software partitioning”, Microprocessors and Microsystems,
vol. 54, pp. 83–96, 2017.
[89] Q. Liu, W. Cai, J. Shen, Z. Fu, X. Liu, and N. Linge, “A speculative approach to
spatial-temporal efficiency with multi-objective optimization in a heterogeneous cloud
environment”, Security and Communication Networks, vol. 9, no. 17, pp. 4002–4012,
2016.
[90] Intel, Hardware Accelerator Research Program, en. [Online]. Available: https : / /
software.intel.com/en-us/hardware-accelerator-research-program (visited
on 12/01/2018).
[91] T. Faict, “Exploring OpenCL on a CPU-FPGA Heterogeneous Architecture Research
Platform (HARP)”, en,
[92] dlmulni1, Intel Xeon Processor Scalable Family Technical Overview, en. [Online].
Available: https://software.intel.com/en-us/articles/intel-xeon-processorscalable-family-technical-overview (visited on 01/11/2019).
[93] Performance Characteristics of Common Transports and Buses, en-US. [Online]. Available: https://www.microway.com/knowledge- center- articles/performancecharacteristics-of-common-transports-buses/ (visited on 01/14/2019).
[94] “Accelerator Functional Unit (AFU) Developer’s Guide for Intel”, en, p. 46,
[95] Acceleration Stack for Intel Xeon CPU with FPGAs Core Cache Interface (CCIP) Reference Manual. [Online]. Available: https : / / www . intel . com / content /
www / us / en / programmab - le / documentation / buf1506187769663 . html (visited
on 12/01/2018).
[96] Intel Acceleration Stack for Intel Xeon CPU with FPGAs 1.0 Errata. [Online]. Available: https://www.intel.com/content/www/us/en/programmab-le/documentation/
ffv1519794536166.html (visited on 12/01/2018).

102

[97] Intel® FPGA SDK for OpenCL™, en. [Online]. Available: https://www.intel.com/
content/www/us/en/software/prog-rammable/sdk-for-opencl/overview.html
(visited on 02/12/2019).
[98] Hardware Accelerator Research Program, en. [Online]. Available: https://software.
intel.com/en-us/hardware-accelerator-research-program (visited on 02/14/2019).
[99] M. Almorsy, J. Grundy, and I. Müller, “An analysis of the cloud computing security
problem”, arXiv preprint arXiv:1609.01107, 2016.
[100] M. Babitha and K. R. Babu, “Secure cloud storage using aes encryption”, in International Conference on Automatic Control and Dynamic Optimization Techniques
(ICACDOT), 2016, pp. 859–864.
[101] B.-H. Lee, E. K. Dewi, and M. F. Wajdi, “Data security in cloud computing using
aes under heroku cloud”, in 27th Wireless and Optical Communication Conference
(WOCC), 2018, pp. 1–5.
[102] C.-C. Lu and S.-Y. Tseng, “Integrated design of aes (advanced encryption standard) encrypter and decrypter”, in Proceedings IEEE International Conference on
Application-Specific Systems, Architectures, and Processors, 2002, pp. 277–285.
[103] The State of Video Marketing in 2018 [Infographic]. [Online]. Available: https://
www.socialmediatoday.com/news/the- state- of- video- marketing- in- 2018infographic/518339/.
[104] S. Rahamneh and L. Sawalha, “An opencl-based acceleration for canny algorithm using a heterogeneous cpu-fpga platform”, in IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), 2019, pp. 322–
322.
[105] R. Eberhart and J. Kennedy, “A new optimizer using particle swarm theory”, in MHS.
Proceedings of the Sixth International Symposium on Micro Machine and Human
Science, 1995, pp. 39–43.
[106] S. G. Li, F. J. Feng, H. J. Hu, C. Wang, and D. Qi, “Hardware/software partitioning
algorithm based on genetic algorithm.”, JCP, vol. 9, no. 6, pp. 1309–1315, 2014.
[107] 61 Social Media Statistics to Bookmark for 2018, en-US, 2018. [Online]. Available:
https://sproutsocial.com/insights/social- media- statistics/ (visited on
11/30/2018).
[108] L. Shao, R. Yan, X. Li, and Y. Liu, “From heuristic optimization to dictionary learning: A review and comprehensive comparison of image denoising algorithms”, IEEE
Transactions on Cybernetics, vol. 44, no. 7, pp. 1001–1013, 2014.
[109] W. Cao, Y. Zhou, C. P. Chen, and L. Xia, “Medical image encryption using edge
maps”, Signal Processing, vol. 132, pp. 96–109, 2017.
103

[110] G. N. Chaple, R. Daruwala, and M. S. Gofane, “Comparisions of robert, prewitt, sobel
operator based edge detection methods for real time uses on fpga”, in Technologies
for Sustainable Development (ICTSD), International Conference on, 2015, pp. 1–4.
[111] Q. Xu, S. Varadarajan, C. Chakrabarti, and L. J. Karam, “A distributed canny edge
detector: Algorithm and fpga implementation”, IEEE Transactions on Image Processing, vol. 23, no. 7, pp. 2944–2960, 2014.
[112] F. A. Pellegrino, W. Vanzella, and V. Torre, “Edge detection revisited”, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), vol. 34, no. 3,
pp. 1500–1518, 2004.
[113] L. H. Lourenco, D. Weingaertner, and E. Todt, “Efficient implementation of canny
edge detection filter for itk using cuda”, in 13th Symposium on Computer Systems,
2012, pp. 33–40.
[114] S. Niu, J. Yang, S. Wang, and G. Chen, “Improvement and parallel implementation of
canny edge detection algorithm based on gpu”, in 9th IEEE International Conference
on ASIC, 2011, pp. 641–644.
[115] J. Lee, H. Tang, and J. Park, “Energy efficient canny edge detector for advanced
mobile vision applications”, IEEE Transactions on Circuits and Systems for Video
Technology, vol. 28, no. 4, pp. 1037–1046, 2016.
[116] X. Li, J. Jiang, and Q. Fan, “An improved real-time hardware architecture for canny
edge detection based on fpga”, in Third International Conference on Intelligent Control and Information Processing, 2012, pp. 445–449.
[117] W. He and K. Yuan, “An improved canny edge detector and its realization on fpga”,
in Intelligent Control and Automation, WCICA 2008. 7th World Congress on, 2008,
pp. 6561–6564.
[118] H. S. Neoh and A. Hazanchuk, “Adaptive edge detection for real-time video processing
using fpgas”, Global Signal Processing, vol. 7, no. 3, pp. 2–3, 2004.
[119] P. Musa and N. F. Irmawati, “Hardware software co-simulation and real-time video
processing for edge detection using matlab simulink model blockset”, Computer Engineering and Inteliigent Systems, vol. 7, no. 1, pp. 43–56, 2016.
[120] H. Quinn, L. A. S. King, M. Leeser, and W. Meleis, “Runtime assignment of reconfigurable hardware components for image processing pipelines”, in Field-Programmable
Custom Computing Machines, FCCM 11th Annual IEEE Symposium on, 2003, pp. 173–
182.
[121] C. Gentsos, C.-L. Sotiropoulou, S. Nikolaidis, and N. Vassiliadis, “Real-time canny
edge detection parallel implementation for fpgas”, in Electronics, Circuits, and Systems (ICECS), 17th IEEE International Conference on, 2010, pp. 499–502.
104

[122] P. S. Deokar, “Implementation of canny edge detector algorithm using fpga”, IJISET
Int. J. Innov. Sci. Eng. Technol, vol. 2, pp. 112–115, 2015.
[123] E. Lbbers, “Accelerating Data Center Workloads with FPGAs”, en, p. 26,
[124] Intel Xeon Processor E5-2680 v4 (35m Cache, 2.40 GHz) Product Specifications, en.
[Online]. Available: https : / / ark . intel . com / products / 91754 / Intel - Xeon Processor-E5-2680-v4-35M-Cache-2-40-GHz- (visited on 01/15/2019).
[125] Early Power Estimators and Power Analyzer. [Online]. Available: https : / / www .
intel.com/content/www/us/en/programmable/support/support- resources/
operation-and-testing/power/pow-powerplay.html (visited on 12/18/2018).
[126] Intel FPGA SDK for OpenCL Pro Edition: Programming Guide. [Online]. Available:
https://www.intel.com/content/www/us/en/programmable/documentation/
mwh1391807965224.html (visited on 01/24/2020).
[127] Y. Luo and R. Duraiswami, “Canny edge detection on nvidia cuda”, in IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops,
2008, pp. 1–8.
[128] D. V. Rao and M. Venkatesan, “An efficient reconfigurable architecture and implementation of edge detection algorithm using handle-c”, in International Conference
on Information Technology: Coding and Computing, Proceedings. ITCC, vol. 2, 2004,
pp. 843–847.
[129] I. Mhadhbi, S. B. Othman, and S. B. Saoud, “A comprehensive survey on hardware/software partitioning process in co-design”, International Journal of Computer
Science and Information Security (IJCSIS), vol. 14, no. 3, 2016.
[130] I. C. Trelea, “The particle swarm optimization algorithm: Convergence analysis and
parameter selection”, Information processing letters, vol. 85, no. 6, pp. 317–325, 2003.
[131] M. Juneja and S. Nagar, “Particle swarm optimization algorithm and its parameters:
A review”, in International Conference on Control, Computing, Communication and
Materials (ICCCCM), 2016, pp. 1–5.
[132] Y. Shi and R. C. Eberhart, “Parameter selection in particle swarm optimization”, in
International conference on evolutionary programming, 1998, pp. 591–600.
[133] A. Farmahini-Farahani, M. Kamal, S. M. Fakhraie, and S. Safari, “Hw/sw partitioning using discrete particle swarm”, in Proceedings of the 17th ACM Great Lakes
symposium on VLSI, 2007, pp. 359–364.
[134] The LLVM Compiler Infrastructure Project. [Online]. Available: https://llvm.org/
(visited on 11/06/2019).
[135] Graphviz - Graph Visualization Software. [Online]. Available: https://www.graphviz.
org/ (visited on 11/06/2019).
105

[136] A. Azizi-Mazreah, M. T. M. Shalmani, H. Barati, and A. Barati, “Delay and energy
consumption analysis of conventional sram”, in Proc. Of World Academy of Science,
Engineering and Technology, vol. 27, 2008.
[137] P. Bhattacharjee and A. Majumder, “A variation-aware robust gated flip-flop for
power-constrained fsm application”, Journal of Circuits, Systems and Computers,
p. 1 950 108, 2018.
[138] L. A. Montalvo, K. K. Parhi, and J. H. Satyanarayana, “Estimation of average energy
consumption of ripple-carry adder based on average length carry chains”, in VLSI
signal processing, IX, 1996, pp. 189–198.
[139] HP Labs : CACTI. [Online]. Available: https://www.hpl.hp.com/research/cacti/
(visited on 10/29/2019).
[140] V. Bushaev, Adam latest trends in deep learning optimization. Oct. 2018. [Online].
Available: https://towardsdatascience.com/adam- latest- trends- in- deeplearning-optimization-6be9a291375c (visited on 01/09/2020).
[141] P. Garg, “A comparison between memetic algorithm and genetic algorithm for the
cryptanalysis of simplified data encryption standard algorithm”, arXiv preprint arXiv:1004.0574,
2010.
[142] J. T. Feo, A comparative study of parallel programming languages: the Salishan problems. 2016.
[143] Y.-k. Choi, J. Cong, Z. Fang, Y. Hao, G. Reinman, and P. Wei, “A quantitative
analysis on microarchitectures of modern cpu-fpga platforms”, in Proceedings of the
53rd Annual Design Automation Conference, 2016, p. 109.

106

.1. Partitioning Cost Results for the Different Benchmarks

Table 1: Partitioning cost of k-means benchmark.

Population size-itr 10
10
20
30
40
50
60
70
80
90
100
Population size-itr 30
10
20
30
40
50
60
70
80
90
100
Population size-itr 60
10
20
30
40
50
60
70
80
90
100

GA
134606338.8
170686336.3
208681785.7
196691973.1
214707113.3
224700776.4
238701304
234725464.6
248727999.4
256720980.7
GA
134606338.8
170686336.3
208681785.7
196691973.1
214707113.3
224700776.4
238701304
189711425.7
248727999.4
256720980.7
GA
227826014.9
167356292.5
208681785.7
194022350.1
214707113.3
213795735.4
203497860
204398215.3
204285670.9
204285670.9

MA
74551824.54
104587796
92606148.14
64564743.12
86580820.46
76587157.38
62586629.82
67232148.33
52559934.4
49454902.87
MA
74551824.54
104587796
92606148.14
71681795.73
86580820.46
76587157.38
50577650.3
60579059.32
52559934.4
44566953.14
MA
59393111.41
80580663.3
82603595.44
54562190.5
72575756.5
67726447.45
70200538.52
69573627.43
69651991.32
69651991.32

107

PSO
470510.6
451267.1
451267.1
457681.6
451267.1
457681.6
457681.6
470510.6
470510.6
457681.6
PSO
474786.93
457681.6
451267.1
457681.6
451267.1
457681.6
457681.6
470510.6
470510.6
457681.6
PSO
457681.6
464096.1
451267.1
457681.6
451267.1
457681.6
457681.6
451267.1
451267.1
451267.1

APSO
444852.6
451267.1
432023.6
444852.6
438438.1
444852.6
438438.1
451267.1
432023.6
444852.6
APSO
457681.6
457681.6
432023.6
444852.6
432023.6
438438.1
432023.6
438438.1
432023.6
438438.1
APSO
457681.6
438438.1
438438.1
432023.6
438438.1
432023.6
439507.18
438438.1
438438.1
438438.1

LPSO
444852.6
438438.1
432023.6
432023.6
438438.1
432023.6
438438.1
432023.6
432023.6
432023.6
LPSO
432023.6
432023.6
432023.6
432023.6
432023.6
432023.6
432023.6
432023.6
432023.6
432023.6
LPSO
432023.6
432023.6
432023.6
432023.6
432023.6
432023.6
432023.6
432023.6
432023.6
432023.6

Table 2: Partitioning cost of Canny benchmark.

Population size-itr 10
10
20
30
40
50
60
70
80
90
100
Population size-itr 30
10
20
30
40
50
60
70
80
90
Population size-itr 60
10
20
30
40
50
60
70
80
90
100

GA
471747.36
499765.44
495949.06
488711.34
481758.18
483643.72
493925.44
494200.24
516109.14
502146.4889
GA
476528
499765.44
495949.06
488711.34
481758.18
483643.72
493925.44
494200.24
516109.14
GA
476528
499765.44
495949.06
488711.34
481758.18
483643.72
497374.3556
494200.24
494200.24
494200.24

MA
386854.34
395193.04
400414.14
406114.12
412143.78
407727.12
399906.64
402162.96
380254.06
387241.7
MA
393035.8
395193.04
400414.14
406114.12
412143.78
415267.17
402437.76
402162.96
380254.06
MA
393035.8
395193.04
400414.14
406114.12
412143.78
411936.3778
399906.64
401952.5333
401952.5333
401952.5333

108

PSO
150075.88
94668.2
58737.1
42410
39915.9
36739.7
36418.6
37603.3
39107.7
35156.2
PSO
127287.18
133454.6
57765.1
41222.5
43819.5
24600
35316.3
37599.1
26968
PSO
123411.84
85331.18
59582.72
44035.9
43692.5
39895.5
31682.6
24600
24600
24600

APSO
97476.5
50798.3
36414.4
24600
24600
24600
26997.8
24600
24600
24600
APSO
82537.7
63503.02
31733.8
30495.1
26942.4
24600
29336
24600
24600
APSO
73315.36
31678.4
26968
26968
24600
26942.4
24600
24600
24600
24600

LPSO
79451.02
48327.9
37603.3
35276.2
34046.4
26942.4
29310.4
24600
24600
29284.8
LPSO
76439.66
72469.4
34020.8
29310.4
29310.4
24600
29336
26942.4
24600
LPSO
55067.52
58052.88
47185.04
30445.8
26942.4
24600
29310.4
24600
24600
24600

Table 3: Partitioning cost of AES benchmark.

Population size-itr 10
10
20
30
40
50
60
70
80
90
100
Population size-itr 30
10
20
30
40
50
60
70
80
90
100
Population size-itr 60
10
20
30
40
50
60
70
80
90
100

GA
373775.2
413314.68
420355.06
419424.58
425305.58
442556.94
413177.64
444560.24
443295.38
447619.58
GA
371144.74
413314.68
419291.86
418239.5
425305.58
442556.94
423843.12
444560.24
443295.38
447619.58
GA
371144.74
413314.68
419291.86
418239.5
425305.58
442556.94
423843.12
444560.24
443295.38
447619.58

MA
296211.32
279059.02
283520.08
301508.5
288333.34
277191.06
300461.28
268930.24
276452.62
272128.42
MA
293028.48
279059.02
281483.4
300323.42
288333.34
277191.06
300461.28
268930.24
265787.14
272128.42
MA
293028.48
279059.02
281483.4
300323.42
288333.34
277191.06
300461.28
268930.24
265787.14
272128.42

109

PSO
279877.54
218712.06
218174.48
199155.08
196850.44
194614.18
192118
198648.1
194614.18
192118
PSO
249671.38
254289.96
211170.72
210719.26
196898.64
194614.18
192118
192118
192118
192118
PSO
252666.52
233064.08
201199.8
192118
192118
192118
192118
192118
192118
192118

APSO
243368.18
196378.52
192118
192118
192118
192118
192118
192118
192118
192118
APSO
209890.58
221355.38
192118
192118
192118
192118
192118
192118
192118
192118
APSO
218393.42
192118
192118
192118
192118
192118
192118
192118
192118
192118

LPSO
227499.68
192118
198648.1
194614.18
192118
192118
192118
192118
192118
192118
LPSO
194614.18
194614.18
196151.92
192118
197110.36
192118
192118
192118
192118
192118
LPSO
200412.44
198874.7
192118
192118
192118
192118
192118
192118
192118
192118

