Tsinghua Science and Technology
Volume 20

Issue 6

Article 12

2015

Simultaneous Accelerator Parallelization and Point-to-Point
Interconnect Insertion for Bus-Based Embedded SoCs
Daming Zhang
Department of Electronic Engineering, Tsinghua University, Beijing 100084, China.

Yongpan Liu
Department of Electronic Engineering, Tsinghua University, Beijing 100084, China.

Shuangchen Li
Department of Electronic and Computer Engineering, University of California, Santa Barbara, CA 93106,
USA.

Tongda Wu
Department of Electronic Engineering, Tsinghua University, Beijing 100084, China.

Huazhong Yang
Department of Electronic Engineering, Tsinghua University, Beijing 100084, China.

Follow this and additional works at: https://tsinghuauniversitypress.researchcommons.org/tsinghuascience-and-technology
Part of the Computer Sciences Commons, and the Electrical and Computer Engineering Commons

Recommended Citation
Daming Zhang, Yongpan Liu, Shuangchen Li et al. Simultaneous Accelerator Parallelization and Point-toPoint Interconnect Insertion for Bus-Based Embedded SoCs. Tsinghua Science and Technology 2015,
20(6): 644-660.

This Research Article is brought to you for free and open access by Tsinghua University Press: Journals Publishing.
It has been accepted for inclusion in Tsinghua Science and Technology by an authorized editor of Tsinghua
University Press: Journals Publishing.

TSINGHUA SCIENCE AND TECHNOLOGY
ISSNll1007-0214ll12/13llpp644-660
Volume 20, Number 6, December 2015

Simultaneous Accelerator Parallelization and Point-to-Point
Interconnect Insertion for Bus-Based Embedded SoCs
Daming Zhang, Yongpan Liu , Shuangchen Li, Tongda Wu, and Huazhong Yang
Abstract: As performance requirements for bus-based embedded System-on-Chips (SoCs) increase, more and
more on-chip application-specific hardware accelerators (e.g., filters, FFTs, JPEG encoders, GSMs, and AES
encoders) are being integrated into their designs.

These accelerators require system-level tradeoffs among

performance, area, and scalability. Accelerator parallelization and Point-to-Point (P2P) interconnect insertion are
two effective system-level adjustments. The former helps to boost the computing performance at the cost of area,
while the latter provides higher bandwidth at the cost of routability. What’s more, they interact with each other. This
paper proposes a design flow to optimize accelerator parallelization and P2P interconnect insertion simultaneously.
To explore the huge optimization space, we develop an effective algorithm, whose goal is to reduce total SoC latency
under the constraints of SoC area and total P2P wire length. Experimental results show that the performance
difference between our proposed algorithm and the optimal results is only 2.33% on average, while the running
time of the algorithm is less than 17 s.
Key words: accelerator parallelization; point-to-point interconnect insertion; bus-based embedded system-on-chips

1

Introduction

Embedded System-on-Chips (SoCs) are widely used in
many fields, such as packet processing[1] , multimedia
processing[2] , and health monitoring[3, 4] . With
application workloads increasing, people are seeking
techniques to improve performance within acceptable
power
budgets.
Application-specific
hardware
accelerators (e.g., filters, FFTs, JPEG encoders,
GSMs, and AES encoders[5] ) seem promising[2, 4] . As
Ref. [6] shows, embedded SoCs with many specific
accelerators have attained up to eleven times of
 Daming Zhang, Yongpan Liu, Tongda Wu, and Huazhong
Yang are with Department of Electronic Engineering,
Tsinghua University, Beijing 100084, China.
E-mail:
zdm06@mails.tsinghua.edu.cn; ypliu@mail.tsinghua.edu.cn;
wutongda@163.com; yanghz@mail.tsinghua.edu.cn.
 Shuangchen Li is with Department of Electronic and Computer
Engineering, University of California, Santa Barbara, CA
93106, USA. E-mail: shuangchenli@ece.ucsb.edu.
 To whom correspondence should be addressed.
Manuscript received: 2015-11-08; accepted: 2015-11-16

improvements in energy efficiency. It is no surprise
that modern embedded SoCs integrate more and more
accelerators. The large number of accelerators leads to
a huge design space, where accelerator parallelization
and interconnection are two key design adjustments.
Accelerator parallelization, which means one
hardware block contains multiple identical functional
units, is used to execute tasks simultaneously, and
improves system performance. Therefore, many
researchers investigate the parallel opportunity to
boost system performance within reasonable area
overhead. References [2, 7, 8] adopted multiple JPEG
accelerators in heterogeneous media SoCs, while Ref.
[9] developed a parallelization method for H264/AVC
on MPSoCs. References [10, 11] found optimal
parallel degrees based on the StreamIt language and
MILP formulations. Reference [12] developed a code
optimization framework for accelerator parallelization
within limited FPGA hardware resources. However,
all of these focused on streaming applications.
Their solutions are not applicable to bus-based

Daming Zhang et al.: Simultaneous Accelerator Parallelization and Point-to-Point Interconnect Insertion   

embedded SoCs, where communication conflicts
among accelerators need to be considered.
Interconnection of multiple accelerators on
embedded SoCs has been effected using two major
architectures: Network-on-Chip (NoC) and bus. NoC is
good at scalability, which is mandatory for connecting
hundreds of accelerators[13] . However, it suffers from
large power and area overheads. As most embedded
applications have dedicated communication patterns,
general-purpose NoC architectures[14, 15] are both
unnecessary and inefficient. Bus is more area- and
power-efficient[16, 17] , but communication conflicts
and delays are unacceptable when the number of
accelerators becomes large[1] . Researchers are seeking
interconnection architectures with hybrid techniques
for embedded SoCs.
Recently, bus-based architecture with Point-toPoint (P2P) channels is receiving attention in many
mainstream designs of embedded SoCs, such as Intel’s
Medfield smart phone[18] , Samsung’s Galaxy smart
phone[19] , the COOKOO smart watch (Bluetooth SIG
Inc.)[20] , and the credit-card sized computer Raspberry
Pi (Raspberry Pi Foundation)[21] . Reference [22]
introduced P2P channels to reduce bus conflicts and
achieve higher bandwidths. Reference [23] investigated
bus matrix synthesis considering P2P interconnect
insertion. Reference [24] developed an MPEG2 decoder
using a bus with P2P channels to relieve heavy data
transmission. Recently, Ref. [25] used NoC routers to
realize P2P interconnection for hardware accelerators
on a bus-based embedded SoC. However, too many P2P
channels cause routing and scaling challenges in layout
design. For example, Ref. [26] found that the total
wire length of N fully P2P-connected square blocks
p
in a regular mesh is .1=3/  n n.n 1/l, where l is
the side length of each block. Therefore, systematical
optimization of the bus-based architecture with P2P
channels is necessary. This also applies to multiple
bus-based architectures[27] , because these architectures
consist of several single-bus-based systems, where each
system contains a shared bus and some P2P channels.
In general, parallel accelerators boost computingintensive applications, while P2P channels reduce
the memory access time for memory-intensive
applications. Both reduce performance bottlenecks
and interact with each other on bus-based embedded
SoCs, which implies the necessity of simultaneous
exploration. To our knowledge, this is the first study
of simultaneous accelerator parallelization and P2P

645

interconnect insertion on bus-based embedded SoCs.
Our contributions include:
 Proposing a general design flow to optimize both
accelerator parallel degrees and P2P channels
on the bus-based embedded SoCs with an
architecture-level system model.
 Developing an effective algorithm to solve the
simultaneous optimization problem. The algorithm
contains three steps. Step 1 generates several
optimization subspaces, which includes a graphbased method for latency estimation on the busbased embedded SoCs. Step 2 uses a greedy
method to get the initial solution and the
optimization direction in each subspace. Step 3
gets the optimal solution from all subspaces by
an improved simulated annealing algorithm based
on the optimized initial solutions and optimization
directions.
 Validating the proposed method on the busbased embedded SoCs with several benchmarks.
Experimental results show that the performance
difference between our proposed algorithm and the
optimal results is only 2.33% on average while the
running time of the algorithm is less than 17 s.
The paper is organized as follows. Section 2
illustrates the proposed design flow and the motivation
for simultaneous optimization. Section 3 describes the
system model of the bus-based embedded SoCs. The
proposed algorithm is given in Section 4. Section
5 presents the experimental results and Section 6
concludes the paper.

2

Overview

This section describes the proposed design flow for busbased embedded SoCs with accelerator parallelization
and P2P interconnect insertion. After that, we explain
the motivation and show the challenges of simultaneous
optimization.
2.1

Proposed design flow

Figure 1 describes the proposed design flow. A
specific embedded SoC has several applications, and
each of them is modeled by data transfer subgraphs
(G 1 –G 4 ). The nodes in the subgraphs represent the
accelerators (1–9) with their local memories. The
edges denote data transmission between accelerators.
Generally, there are branches and feedback loops in the
subgraphs. Specific schedules define each application’s
execution priority. The optimization constraints, given

Tsinghua Science and Technology, December 2015, 20(6): 644-660

646
Input:

ECG case

2

3
4

5

G3

G4

1

2

3

6

6

7

8

9

3

6

Performance
improvemnt (%)

G2

1

Schedule

G1

Constraints

Application subgraphs

Accelerator parallelization and
P2P interconnect insertion

60
45
30
15

0
Area
P2P

Optimization:
Parallel degrees of accelerators
and P2P channels

R2 case
Accelerator parallelization
P2P interconnect insertion
Joint optimizations

20
20

35
50
65
20
35
50
35
50
65
20
35
50
Area overheads and extra P2P channels (%)

65
65

Fig. 2 Optimization effects of accelerator parallelization and
P2P interconnect insertion.

Output:
Optimized hardware system
Controller

Paralleled
accelerator
3

ACC2

2

ACC2

LM2

P2P
channel

6

Interface

Interface

Bus
Processor

Main memory

Fig. 1

1

4

5

7

8

9

Proposed design flow.

by designers, include both the area of the SoC and the
total wire length of the P2P channels.
Based on the application subgraphs and schedule
operating on the embedded SoC, an algorithm is
developed to optimize the accelerator parallel degrees
and P2P channels simultaneously under the constraints
(see Section 4). After that, the optimized hardware
system is built. It contains a processor, a main memory,
a bus, and all accelerators (ACCs). Each accelerator
has a local memory (LM). Small controllers and P2P
interfaces are used to support accelerator parallelization
and P2P interconnect insertion.
2.2

Motivation and challenges

Following the proposed design flow in Section 2.1,
Fig. 2 demonstrates the performance improvements
of two benchmarks (see Section 5.1) with either
accelerator parallelization or P2P interconnect insertion
under different constraints. The X axis shows
the area overheads and extra wire length of P2P
channels, which are used as optimization constraints.
The Electrocardiograph (ECG) case is computingintensive, while the Random 2 (R2) case is memoryintensive. Each benchmark contains several application
subgraphs, and each subgraph has several accelerators.
As we can see, accelerator parallelization is more
effective than P2P interconnect insertion in the ECG
case and vice versa. Furthermore, the two techniques

interact with each other in the joint optimization, which
implies the necessity of simultaneous exploration.
In order to find the optimal accelerator parallel
degrees of accelerators and P2P channels on busbased embedded SoCs, we need to explore a
huge optimization space for simultaneous accelerator
parallelization and P2P interconnect insertion. All
possible combinations of accelerator parallel degrees
and P2P channels must be considered. Previous
studies[11] showed that the optimal accelerator parallel
degrees for streaming applications were difficult to
determine (the complexity is O.Nb4 /, where Nb is
the number of blocks), though streaming applications
enabled us to estimate the latency analytically. Also,
bus-based embedded SoCs introduce communication
conflicts to make the latency estimation even more
complex, which makes the optimization even slower.
We propose an algorithm to attack these challenges in
Section 4.

3

System Modeling and Formulation

This section builds the system model and defines the
optimization variables. We then present the problem
formulation of simultaneous accelerator parallelization
and P2P interconnect insertion.
3.1

Application subgraph models (ACC related)

Table 1 summarizes the related parameters and
optimization variables in the system model. As most
embedded applications have dedicated communication
patterns[3, 4, 24] , we assume that the parameters of
hardware accelerators, the processor, the bus, and the
main memory are fixed.
The system has M application subgraphs; each of
them contains several accelerators. The total number
of accelerator types in all subgraphs is N . We use
G m .V m ; E m / to denote the m-th application subgraph,

Daming Zhang et al.: Simultaneous Accelerator Parallelization and Point-to-Point Interconnect Insertion   
Table 1
Type

Parameters and variables for system modeling.

Parameter

Description

ini /exei /outi Input/Execution/Output delay per execution
ACC

nm
i

related

ai =mai

Execution times of ACCi in G m

m
di;j

Data volume from ACCi to ACCj in G m

pa/mma

channel

bset/ba

related

Area of the processor/main memory
Transmission delay through the bus/a P2P

bt/pt

SoC

Area of ACCi and its local memory

Setting delay/Area of the bus

Schedule Schedule of application subgraphs

Opt.
related

pi

Parallel degree of ACCi

li;j

P2P interconnection from ACCi to ACCj

Aconst =Asum Area constraint/Optimized area of the SoC
Lconst =Lsum Constraint of/Optimized P2P wire length
Tsum

Execution delay of the SoC

where V m represents the accelerator set and E m is the
edge set.
(
V m D f   ; ACCi ;    g;
m
m
E m D f   ; di;j
;    ; d0;j
g;
i; j 2 Œ1; N ; m 2 Œ1; M 

(1)

In Table 1, an accelerator (ACCi ) contains six
parameters: the input/execution/output delay per
execution (ini /exei /outi ), the execution times (nm
i )
in G m , the area of the accelerator (ai ) and its local
m
memory (mai ). di;j
denotes the data volume from
m
m
ACCi to ACCj in G . If di;j
D 0, there will be no
data transmission from ACCi to ACCj in G m . Since
we use the index “0” to represent the main memory of
m
the SoC, d0;j
represents the data volume from the main
memory to ACCj in G m .
Figure 3 presents the workflow of each accelerator.
First, it reads data from the main memory or local
memories of other accelerators; then, it executes several
times to process data and writes results to its local
memory;exefinally, results
are written to the main memory
ACCi
NUM im
or the local memories of other accelerators.
exei

ai

nim
ACC j

ACCi

in i

out i
LMi

ma i
Fig. 3

j

i

3.2

Bus-based embedded SoC model (SoC related)

We define the parameters of the main memory, the bus,
and the schedule of application subgraphs in the busbased embedded SoC model. pa/mma denotes the area
of the processor/main memory. bt/pt is the transmission
delay for an accelerator to read one data item through
the bus/a P2P channel. bset/ba is the setting delay/the
area of the bus.
Each accelerator’s priority is decided by two factors:
the subgraph priority associated with the accelerator,
and its level in a subgraph. The subgraph priority is
determined by the schedule of application subgraphs
(Schedule), and the accelerator level is decided by a
breadth-first search. In Fig. 4, if two accelerators are in
different subgraphs, the accelerator level will dominate
the accelerator priority. For example, ACC5 in Level 2
will execute after ACC4 in Level 1. If two accelerators
have the same level, the accelerator with a higher
subgraph priority will execute first. Furthermore, if two
accelerators have the same level and subgraph priority,
the accelerator with a smaller index will execute first.
For example, ACC3 has a higher priority than ACC4 in
Gn.
3.3

Optimization variables (Opt. related)

We use a sequence Var to define optimization variables
of accelerator parallelization and P2P interconnect
insertion as follows,
Var D fp1 ; p2 ;   ; pi ;   ; pN ;   ; li;j ;   g;

E0,mi

Main memory

0

Parameters of application subgraph models.

M
X

li;j 2 Var; if

m
di;j
> 0; i; j 2 Œ1; N 

(2)

mD1

where pi is the parallel degree of ACCi . There is an
Schedule  {G1 ,

,

Gm ,

,

G n }, m, n  [1, M ]

Highest
priority

High
priority

Level 1

1

...

2

Level 2

5

...

6

Level 3

7

...

8

Level 4

5

Level 5

7

LM j

Eim, j

647

Lowest
priority

Fig. 4

...
...
...

4

3

9

6

10

High
priority

Data transmission

Low
priority

Accelerator priority

Accelerator priorities in the system model.

Tsinghua Science and Technology, December 2015, 20(6): 644-660

648

upper bound pi max for pi as follows, as ACCi has a
communication bottleneck max.ini ; outi /. It means no
more performance improvement is achieved, if pi >
pi max .
1 6 pi 6 pi max ; i 2 Œ1; N ;
pi

max

D d.ini C exei C outi /= max.ini ; outi /e (3)

li;j stands for the P2P interconnect insertion from ACCi
to ACCj . That is, li;j D 1 if there is a P2P channel
from ACCi to ACCj ; otherwise, li;j D 0. If there
is no data transmission from ACCi to ACCj in any
M
X
m
subgraphs (
di;j
D 0), li;j D 0 and it will not be
mD1

included in Var for optimization. The total number of
variables in Var is N C D. N is the variable number of
accelerator parallelization, which is equal to the number
of accelerator types. D is the variable number of P2P
interconnect insertion.
Aconst stands for the area constraint of the SoC. Lconst
is the wire length constraint of the P2P channels, which
stands for the routability of the SoC. Asum and Lsum are
the optimized area of the SoC and the total wire length
of the P2P channels. Tsum is the total execution delay of
the SoC, which is the latency to execute a Schedule.
3.4

Problem formulations

The optimization problem is to find the optimal Var
(Varopt ) to minimize Tsum , while meeting the constraints
of the area and P2P wire length:
(
Minimize Tsum ;
(4)
Subject to Asum 6 Aconst and Lsum 6 Lconst
Asum is calculated as follows,
N
X
Asum D
.ai  pi C mai / C pa C mma C ba

(5)

i D1

Asum is the total area of the optimized accelerators (ai 
pi ) with their local memories (mai ), the processor (pa),
the main memory (mma), and the bus (ba).
To calculate Lsum , we consider the layout design
of the bus-based embedded SoC. First, we put each
accelerator into a fixed rectangular region based on
a certain order. The fixed region and the order are
defined by the designer. Then, we get the relative
center coordinates of ACCi , as we assume ACCi is a
square and its area is calculated as ai  pi max C mai .
It is the worst-case Lsum calculation, as ACCi gets the
maximum parallel degree (pi max ). With the relative
center coordinates of ACCi , we calculate the Manhattan
distance from ACCi to ACCj as the wire length of

the P2P channel (li;j )[28] . Thus, Lsum is calculated as
follows,
N X
N
X
.jxi xj j C jyi yj j/  li;j
(6)
Lsum D
i D1 j D1

where (xi , yi ) and (xj , yj ) are the relative center
coordinates of ACCi and ACCj . Besides, the relative
center coordinates of accelerators change, if the order
changes. In this way, the optimization is executed
iteratively under different orders.
However, there are no analytical expressions of
Tsum , because the communication patterns of busbased embedded SoCs under multiple applications are
different from those of streaming ones[10, 11] , where
communication conflicts have to be considered. A
direct method for estimation is to use a cycle-accurate
event simulator, which detects the state of each
accelerator in each cycle and records the current delay.
The complexity of the method is Tsum  N , where
N denotes the number of accelerator types in the
SoC. As Tsum needs to be estimated many times in
the optimization, the simulator-based method suffers
from poor performance, and it is unacceptable in the
optimization. Therefore, we propose a graph-based
method for effective Tsum estimation. The method is
presented in the Appendix.

4

Proposed Algorithm

In this section, we develop an algorithm for
simultaneous accelerator parallelization and P2P
interconnect insertion on bus-based embedded
SoCs. The algorithm achieves optimal or suboptimal
solutions, and its linear complexity is small. First, we
give the framework of the proposed algorithm. Then
we present each step of the algorithm in detail. Finally,
we summarize the algorithm and show its complexity.
4.1

Overview of the proposed algorithm

According to Section 2.2, solutions of different
accelerator parallel degrees and P2P channels have
different optimization effects. A traversal algorithm
arrives at the optimal solution by testing
! all possible
N
Y
solutions. Its complexity is
pi max  2D , where
i D1

N and D are the variable numbers of accelerator
parallelization and P2P interconnect insertion in Var.
Exponential complexity leads to excessive running
time. Moreover, this approach fails to find solutions
when the SoC size (accelerators and edges) and

Daming Zhang et al.: Simultaneous Accelerator Parallelization and Point-to-Point Interconnect Insertion   

constraints become large. To reduce the running time,
the traditional greedy algorithm
has a small linear
!
N
X
pi max C D), but it only archives
complexity (
i D1

a local optimal solution; the traditional Simulated
Annealing Algorithm (SAA) achieves a better even
optimal solution, under a certain number of iterations.
However, it is still far from effective. In particular, in
order to get close to the optimal solution, the number
of iterations becomes quite large as the SoC size and
constraints increase.
Therefore, we propose an algorithm (see Fig. 5) to
reduce the complexity and achieve better solutions.
The algorithm contains three steps. Step 1 generates
K optimization subspaces (f˝ k g; k 2 Œ1; K) with
corresponding solutions (Vark sub ), where a graph-based
method is used for Tsum estimation; Step 2 presents
a greedy method to get an initial solution (Vark ini )
and an optimization direction (Odk ) in each subspace;
Step 3 uses an improved SAA to get an optimized
solution (Vark opt ) based on its initial solution and
the optimization direction in each subspace. Finally,
the optimal solution (Varopt ) with the minimum Tsum
is achieved. We illustrate each step in the following
subsections.
4.2

Step 1 Optimization subspace generation

According to Section 4.1, traditional algorithms (e.g.,
traversal, greedy, and SAA) are not suitable to obtain
pi  1, li , j  0, Tsum
sum =0

Initialization:

Tsum estimation for the critical path

Step 1

Optimization subspace generation

1

Var1_ sub
Greedy method in

1

…

…

Var K _ sub

K

Greedy method in 

K

Step 2
Var1_ ini & Od1
Improved SAA in 

Step 3

Var1_opt

…

Var K _ini & Od K
Improved SAA in 

1

…

Var K _opt

Var opt with minimum Tsum

Fig. 5

Framework of the proposed algorithm.

K

649

the optimal solution, as the optimization space is huge.
Therefore, we need to generate several small subspaces,
where the optimization executes more effectively,
especially in multi-core or multi-threading platforms.
However, it is a big challenge to find an approach
to optimization subspace generation, as accelerator
parallelization and P2P interconnect insertion need
to be considered simultaneously, while including the
optimal solution in the subspaces.
Observing the execution procedure of multiple
accelerators on bus-based embedded SoCs, we focus
on the critical path of the performance bottlenecks,
as the delay of the critical path is the total execution
delay of the SoC (Tsum ). Therefore, Tsum cannot be
reduced unless the bottlenecks on the critical path
have been optimized. In other words, one or more
variables (pi or li;j ) related to the bottlenecks on
the critical path are optimized in the optimal solution
(Varopt ). In order to generate optimization subspaces,
we increase each variable related to the bottleneck
on the critical path as follows: pi D 2 or li;j D 1.
Thus, each increased variable (pi D 2 or li;j D 1) and
other original variables (pg D 1; lg;h D 0) form a new
solution (Vark sub ) as follows:
Vark sub D fpi D 2jjli;j D 1; other pg D 1; lg;h D 0g;
k 2 Œ1; K; i ¤ g; j ¤ h; i; j; g; h 2 Œ1; N  (7)
K is the number of the new solutions, which is equal
to the number of the variables related to the bottlenecks
on the critical path. The number of variables in each
new solution (Vark sub ) is N C D, where N and D
are the variable numbers of accelerator parallelization
and P2P interconnect insertion. Thus, each solution
(Vark sub ) stands for an optimization subspace (˝ k ),
which means we continue to optimize the variables
based on the solution under the constraints. As one or
more increased variables in these solutions (Vark sub )
are optimized in the optimal solution (Varopt ), we get
the optimal solution (Varopt ) by optimizing in these
subspaces. As some variables may not be related to
the bottlenecks on the critical path, the union of all
subspaces (˝ 1 [[˝ K ) is included in the optimization
space (˝). The relationships are presented as follows.
Varopt 2 .˝ 1 [    [ ˝ K /  ˝
(8)
Figure 5 presents the workflow of optimization
subspace generation in Step 1. After initialization, Tsum
is estimated to find the critical path of performance
bottlenecks. Then, K optimization subspaces (the new
solutions) are generated based on the variables related
to the bottlenecks on the path.

Tsinghua Science and Technology, December 2015, 20(6): 644-660

650

4.3

Step 2 Greedy method in each subspace

With the increased variables gained in Step 1, a
traditional SAA in each subspace is still far from
effective, because its initial solution (Vark ini ) and
the optimization direction (Odk ) are chosen randomly,
which leads to local optimal solutions. Therefore, we
need to optimize the initial solution (Vark ini ) and the
optimization direction (Odk ) in each subspace before
executing the SAA. We therefore adopt a greedy metricbased method, where a metric stands for the potential
optimization effect of increasing a variable (pi or li;j )
by one step (pi D pi C 1, if pi < pi max or li;j D 1, if
li;j D 0). We calculate the optimization direction and
increase the variable that has the maximum metric in the
initial solution (Vark ini ), until there are no optimization
effects under the constraints. Finally, the optimized
initial solution (Vark ini ) and the optimization direction
(Odk ) are achieved. As it is a greedy method, the
optimized initial solution (Vark ini ) and the optimization
direction (Odk ) are local or even global optimal results,
which are much better than random ones.
Figure 6 presents the algorithm in the k-th subspace
(˝ k ). First, initial values are assigned to the initial
solution (Vark ini ), Asum , Lsum , and Tsum . Second,
Tsum estimation is done. Third, the metrics and the
optimization direction (Odk ) are calculated based on
the critical path of the performance bottlenecks. Then,
Initialization: Var k _ ini  Var k _ sub

Asum  Lsum  Tsum  0

Tsum (Var k _ ini ) estimation
Variable reset
& metric
invalidation

Metrics & Od k calculation

All metrics invalid ?
No

Yes

Variable selection & increase

Asum & Lsum calculation

Asum  Aconst & Lsum  Lconst ?
Yes

No

Output: Optimized Var k _ ini & Od k

Fig. 6

˝ k ).
Greedy method in the k-th subspace (˝

the variable with the maximum metric in the initial
solution (Vark ini ) is selected and increased by one step.
Fifth, Asum and Lsum are calculated with the new initial
solution (Vark ini ). After that, Tsum is estimated, if the
constraints are met (Asum 6 Aconst and Lsum 6 Lconst );
otherwise, the selected variable is reset (pi D pi 1 or
li;j D 0) and its metric is invalidated. Then, we repeat
the steps above until all metrics are invalid. Finally, the
optimized initial solution (Vark ini ) and the optimization
direction (Odk ) are achieved. As a greedy method,
! the
N
X
pi max CD,
complexity is effectively reduced as
i D1

where N and D are the variable numbers of accelerator
parallelization and P2P interconnect insertion in the
initial solution (Vark ini ).
In the algorithm, Asum and Lsum are easily calculated
by Eqs. (5) and (6). However, calculations of the
metrics and the optimization direction (Odk ) bring
great challenges. The metric needs to be calculated
carefully to meet the architecture-specific tradeoffs
between the optimization effects and the overheads.
The optimization direction (Odk ) has a strong effect on
the qualities of the generated solutions in the SAA. We
discuss them in the following subsections.
4.3.1

Cost-effectiveness-based metric calculation

To calculate metrics of variables, we make use of
the following observation on accelerator parallelization
and P2P interconnect insertion. When considering
accelerator parallelization, accelerator conflicts are
relieved with area overheads; when considering P2P
interconnect insertion, bus conflicts decrease with
greater P2P wire length, which brings more routability
challenges. Therefore, both these considerations reduce
the total execution delay of the SoC (Tsum ) with
different overheads.
Based on the observation, we define the
cost-effectiveness based metrics of accelerator
parallelization and P2P interconnect insertion. In
order to compare two kinds of metrics simultaneously,
the effect is defined as Tsum reduction on the critical
path. The cost is the overhead ratio of each constraint.
A metric sequence (Mskq ) stores all cost-effectiveness
metrics (mskq;l ) after the q-th iteration in ˝ k as follows:
Mskq D fmskq;1 ; mskq;2 ;   ; mskq;l ;   g;
l 2 Œ1; N C D; q 2 Œ1; Qk ; k 2 Œ1; K

(9)

where Qk is the total number of iterations in the k-th
subspace (˝ k ) and q stands for the q-th iteration.

Daming Zhang et al.: Simultaneous Accelerator Parallelization and Point-to-Point Interconnect Insertion   

We calculate the metrics (mskq;l ) of accelerator
parallelization (pi ) increased by one step in the q-th
iteration as follows:
M
X
m
tem
Œtem
i .pi C 1/  flagi
i .pi /
mskq;l D

mD1

ai =.Aconst

;

Anon-opt /

i 2 Œ1; N ; q 2 Œ1; Qk ; k 2 Œ1; K

(10)

where tem
i .pi / stands for the execution delay of ACCi
m
in G , which contains input, execution, and output
stages. The delay is related to accelerator parallelization
(pi ) and it is calculated by Eq. (A2) in Appendix. In
Eq. (10), the numerator is Tsum reduction with an extra
m
ACCi . flagm
i D 1, if ACCi in G is on the critical path;
otherwise, flagm
i D 0. The denominator is the overhead
ratio with an extra ACCi , where Anon-opt is the area of
the non-optimized SoC. It is calculated as follows:
N
X
Anon-opt D
.ai C mai / C pa C mma C ba (11)
i D1

Anon-opt is the total area of non-optimized accelerators
(ai ) with their local memories (mai ), the processor (pa),
the main memory (mma), and the bus (ba).
We calculate the metrics (mskq;l ) of P2P interconnect
insertion (li;j ) increased by one step in the q-th iteration
as follows,
M
X
m
Œtrm
trm
i;j .li;j /
i;j .1/  flagi;j
mskq;l D

mD1

.jxi

xj j C jyi

yj j/=Lconst

i; j 2 Œ1; N ; q 2 Œ1; Qk ; k 2 Œ1; K

;
(12)

trm
i;j .li;j /

where
denotes the input delay to read data
from a former accelerator ACCi in G m . The delay
is related to P2P interconnect insertion (li;j ) and it
is calculated by Eq. (A1) in Appendix. In Eq. (12),
the numerator is Tsum reduction with an extra P2P
channel from ACCi to ACCj ; flagm
i;j D 1, if data
transmission from ACCi to ACCj in G m is on the
critical path; otherwise, flagm
i;j D 0. The denominator
is the overhead ratio, with an extra P2P channel from
ACCi to ACCj . mskq;l D 0, if li;j D 1. It means that
a P2P channel from ACCi to ACCj has already been
connected and no more Tsum reduction is achieved.
4.3.2

Optimization direction calculation

According to Section 4.1, a traditional SAA generates
new solutions randomly, which leads to bad results
under certain iterations. Therefore, we need to find an
effective optimization direction (Odk ) to generate new

651

solutions, which are likely to be closer to the optimal
solution. To calculate the optimization direction (Odk ),
we make use of the following observation on the
variables in the initial solution (Vark ini ), which have
positive metrics during the iterations. (1) Increasing the
variables, which have larger metrics, helps us get closer
to the optimal solution (Vark opt ); (2) As the number
of iterations increases, the initial solution (Vark ini ) gets
closer to a local optimal one. Thus, better solutions are
likely to be achieved if the variables with larger metrics
are increased in earlier iteration levels. Therefore, we
calculate the optimization direction Odk as follows,
where both the metrics of these variables and their
iteration levels are considered.
Odk D fodk1 ; odk2 ;    ; odkl ;    g;
k

odkl

D

Q
X

mskq;l  .Qk

q C 1/;

qD1

l 2 Œ1; N C jLj; q 2 Œ1; Qk ; k 2 Œ1; K

(13)

where Qk is the total number of iterations in the k-th
subspace (˝ k ) and q stands for the q-th iteration.
Each element (odkl ) in the optimization direction (Odk ),
which corresponds to a variable in the initial solution
(Vark ini ), is the weighted sum of the variable metric in
each iteration mskq;l . To generate new solutions in the
SAA, the elements in the optimization direction (Odk )
are transformed into selected probabilities as follows:
(
0; if odkl D0;
k
odl D
0:5  Œodkl =.max.Odk / C 1/ C 1; else;
odkl 2 Odk ; l 2 Œ1; N C jLj

(14)

where each positive element (odkl ) has a high selected
probability (0:5  fodkl =Œmax.Odk / C 1 C 1g), which
is larger than 0.5. It means the corresponding variable
has more chances to be chosen and increased for
optimization.
4.4

Step 3 Improved SAA in each subspace

Based on the initial solution (Vark ini ) and the
optimization direction (Odk ) achieved in Section 4.3,
we propose an improved SAA. Compared with the
traditional SAA, the proposed one has three improved
points: (1) The current solution (Vark cur ) is set as the
initial solution (Vark ini ), and the optimization direction
(Odk ) is used to generate new solutions selectively.
In this way, the optimal solution is achieved rapidly.
(2) Tsum estimation executes only if the variables
in the generated new solution (Vark new ) cannot be

Tsinghua Science and Technology, December 2015, 20(6): 644-660

652

increased under the constraints (Asum 6 Aconst and
Lsum 6 Lconst ). In this way, the new solution (Vark new )
gets closer to the optimal one (Varopt ). (3) The current
optimal solution (Varopt ) generated during iterations is
stored.
The improved SAA in the k-th subspace (˝ k )
is presented in Fig. 7. First, initial values are
assigned to the current solution (Vark cur ), the current
optimal solution (Vark opt ), and the simulated annealing
temperature (Tem). Tstart is the start temperature of
simulated annealing. Then, a new solution (Vark new ) is
generated based on the current solution (Vark cur ) and
the optimization direction (Odk ). Then, Tsum .Vark new /
and Tsum .Vark cur / are estimated, if the variables
in the new solution (Vark new ) cannot be increased
under the constrains (Asum 6 Aconst and Lsum 6 Lconst );
otherwise, it needs to be generated again. Third,
the current solution Vark cur and the current optimal
solution Vark opt are set as the new solution (Vark new ),
if Tsum .Vark new / < Tsum .Vark cur /; otherwise, the new
solution (Vark new ) is accepted conditionally based on
a rule, while the current optimal solution (Vark opt )
remains. Fourth, the simulated annealing temperature
Initialization: Tem  Tstart
Var k _ opt  Var k _ cur  Var k _ ini
k _cur
&Odk
Var k _ new generation based on Var

Variables cannot be
increased under constraints?

No

Tsum (Var k _ new ) & Tsum (Var k _ cur ) estimation

Tsum (Var k _ new )  Tsum (Var k _ cur ) ?
Yes

Var k _ opt  Var k _ cur  Var k _ new

No

Var k _ new is
conditionally
accepted
No

Reach iteration number?
Yes
No

Tem decrease

Yes
Output: Optimized Var k _ opt

Fig. 7

odki 2 Odk ; i 2 Œ1; N ; k 2 Œ1; K

˝ k ).
Improved SAA in the k-th subspace (˝

(15)

where pik cur is the current variable of accelerator
parallelization in the current solution ((Vark cur ) and
rand(1) is a random probability function (rand(1) 2
.0; 1/). That is, pik new D pik cur or max.pik cur 1; 1/, if
odki D 0, which means pi in the k-th subspace (˝ k )
does not need to be increased. The new variable of
k new
) in the new solution
P2P interconnect insertion (li;j
k new
(Var
) is generated as follows,
(
1; if rand.1/ 6 odki ;
k new
li;j
D
0; else;
odki 2 Odk ; i; j 2 Œ1; N ; k 2 Œ1; K
k new
li;j

Yes

Tem  Tstop?

(Tem) decreases if the iteration number is reached;
otherwise, it remains. Then, we repeat the steps
above until it meets the stop temperature of simulated
annealing (Tstop). Finally, we get the current optimal
solution (Vark opt ) in the k-th subspace (˝ k ), and
the optimal solution (Varopt ) of all subspaces is also
achieved.
In the algorithm, the new solution (Vark new )
generation, the conditional acceptance rule, and the
simulated annealing function are presented as follows.
Based on the current solution (Vark cur ) and the
optimization direction (Odk ), the new variable of
accelerator parallelization (pik new ) in the new solution
(Vark new )8
is generated as follows:
k cur
ˆ
C 1; pi max /; if rand.1/ 6 odki ;
<min.pi
pik new D pik cur ; if odki 6 rand.1/ < 0:5  .1 C odki /;
ˆ
:max.p k cur 1; 1/; else;
i

(16)

odki

where
 0, if
D 0, which means li;j in the
k
k-th subspace (˝ ) does not need to be optimized.
The conditional acceptance rule is presented as
follows,
(
Vark new ; if exp. Tsum =Tem/ > P;
k cur
D
Var
Vark cur ; else;
Tsum D Tsum .Vark new /

Tsum .Vark cur /; k 2 Œ1; K
(17)
where P is the defined conditional acceptance
probability (P 2 .0; 1/). Tsum is the performance
difference. In Eq. (17), the new solution (Vark new ) is
more difficult to be accepted as the simulated annealing
temperature (Tem) decreases.
The simulated annealing function is presented as
follows,
Tem D Tem  ˛; ˛ 2 .0; 1/
(18)
where ˛ stands for the speed of simulated annealing.

Daming Zhang et al.: Simultaneous Accelerator Parallelization and Point-to-Point Interconnect Insertion   

4.5

Summary of the proposed algorithm

In general, the proposed algorithm, which searches all
generated subspaces in parallel based on the optimized
initial solutions and the optimization directions,
achieves or gets close to the optimal solution. Based
on the introduction of"three steps, the
! algorithm has#a
N
X
pi max C D C NumK 
linear complexity: 1C
i D1

K. “1” is Tsum estimation
! for optimization subspace
N
X
generation;
pi max CD is the iteration number of
i D1

the greedy method; NumK is that of the improved SAA
in each subspace. The performance and complexities of
the proposed algorithm are evaluated in Section 5.

5

Experimental Evaluation

In this section, we first explain our experimental
configurations and show test results aimed at validating
the system model. Then, we present the effectiveness
of the proposed algorithm under different constraints
and benchmarks. Finally, we analyze the simultaneous
optimization effects with the proposed algorithm under
two factors: Schedule and benchmark patterns.
5.1

Experimental setup

Real application subgraphs are selected from
multimedia processing, digital signal processing,
and so on. The C codes are provided for C2RTL
tools, which can obtain the HDL files of hardware
accelerators. Modelsim and DC compiler acquire each
accelerator’s timing and area information under SMIC
130 nm technology. The accelerator information and
constraints are given to the optimizing algorithms,
which are implemented by Matlab running on an Intel
3 GHz notebook with 4 GB RAM.
The benchmarks include an ECG case[29] , a Media
case[30] , and three random cases (R1 R3). The ECG
case contains six types of accelerators: low-pass
filter, high-pass filter 1, high-pass filter 2, 64 tps
FFT, QRS wave detection, and 128-AES encoder.
These accelerators form three application subgraphs:
heart-rate data filtering, QRS wave detection, and
frequency detection. The Media case contains eight
types of accelerators: RGB-to-YCBCR, DCT-ZigZag,
quantization, RLE encoder, 64 tps FFT, median
filter, JPEG decoder, and YCBCR-to-RGB. These
accelerators form three application subgraphs: JPEG
encoder/decoder and frame-noise reduction. These

653

benchmarks are constructed by over thirty types of
accelerators. There are different branches and feedback
loops in the benchmarks. The numbers of accelerator
types, interconnect edges, and application subgraphs in
the random benchmarks are presented as follows:
 R1: 6 accelerator types and 6 edges in 3 subgraphs;
 R2: 9 accelerator types and 10 edges in 4
subgraphs;
 R3: 20 accelerator types and 23 edges in 6
subgraphs.
5.2

System model validation

Before experimental evaluation, we validate the
execution delay and the chip area of the system model
in three benchmarks. Modeling results (Model) are
compared with RTL results (RTL). Modeling results
are simulated by Matlab; RTL results are simulated by
Modelism and synthesized by a DC compiler. Table 2
shows that two kinds of results match well. The average
error of execution delay is only 2.38%, while the area
difference is less than 4%. We estimate Tsum with the
worst-case input delays in branch pattern (A), which
are presented in Appendix. Thus, the modeling results
of execution time are larger than those achieved with
RTL. We calculate Asum by Eq. (5) without the area
of small controllers and P2P interfaces. Therefore, the
RTL results regarding chip area are larger, compared
with the modeling results.
5.3

Proposed algorithm evaluation

The proposed algorithm is compared with the
traditional ones: the greedy algorithm, the SAA,
and the traversal algorithm. The total iteration number
of the traditional SAA is 160, and the iteration number
in each subspace of the proposed algorithm (NumK )
is 16. The start and stop temperatures of simulated
annealing (Tstart and Tstop) are 100 and 10. The
defined annealing probability (P) in the conditional
acceptance rule is 0.5, and the speed of simulated
Table 2
Benchmark
ECG
Media
R2

System model validation.

Optimized
variables

Area (mm2 )

Delay (cycle)
Model

RTL

Model RTL

3

3

0.871 0.892

3

p1D2; l1;2D1 1:5310 1:4910

3

0.907 0.943

Non-optimized 4:48106 4:39106

Non-optimized 1:6910 1:6510

1.41

1.48

p1D2; l2;3D1 3:5910 3:5110

6

1.47

1.55

Non-optimized 2:15104 2:09104

1.13

1.16

1.27

1.32

6

4

p2D3; l1;4D1 1:7710 1:7310

4

Tsinghua Science and Technology, December 2015, 20(6): 644-660

annealing (˛) is 0.55. The regions for layout and the
order in which to put the accelerators into these regions
are fixed.
First, we test performance improvement and
complexities of these algorithms in two situations:
a single benchmark (R1) under different constraints
and different benchmarks under certain constraints.
Figure 8 shows the performance comparison in the
R1 case. The X axis shows the area overheads and
extra wire length of P2P channels, which are used as
the optimization constraints. The largest performance
difference between the proposed algorithm and the
traversal one is 4.60%. On the other hand, the greedy
algorithm and the SAA are quite unstable, as they
can only achieve local optimal results by random
optimization. Compared with the traversal algorithm,
the largest performance differences are 16.6% and
8.50%. Traversal fails at the point of 53% area
overhead and 83% extra P2P wire length. Table 3
presents running time in the R1 case. The constraints
are the percent of area overheads and extra wire
length of P2P channels. The longest running time of
the proposed algorithm is 13 s, which is better than
that of the SAA. What’s more, traversal suffers from
exponential complexity, and fails, as shown in the last
row of Table 3, when the constraints become larger.
We also compare the performance improvement
and complexities of these algorithms in different

60

Greedy
SAA
Proposed
Traversal

40

30
20

20
17
26
35
44
53
Area
P2P
33
50
50
67
83
Area overheads and extra wire length of P2P channels (%)

Fig. 8 Performance improvement under different
constraints.

P2P

Greedy

SAA

Proposed

Traversal

17

33

0.133

19.2

12.7

103

26

50

0.142

19.2

12.8

785

35

50

0.156

19.2

12.8

4:49  103
4

44

67

0.173

19.3

12.9

3:76  10

53

83

0.192

19.2

13.0

Failed

Failed！

ECG

Media

Running time in different benchmarks.
Running time (s)
SAA

Proposed

Traversal

R1

0.156

19.2

12.8

4:49  103

R2

0.287

21.9

14.7

6:64  104

R3

0.418

23.9

16.3

Failed

ECG

0.201

21.4

14.1

4:73  104

Media

0.372

22.8

15.8

Failed

55

Area

R3

Greedy

Running time under different constraints.
Running time (s)

Failed！

R2

Performance improvement in different benchmarks.

Benchmark
Failed！

Constraint (%)

40

Table 4

30

Table 3

50

SAA
Traversal

Benchmarks

Performance
improvement (%)

50

Greedy
Proposed

R1

Fig. 9

60
Performance
improvemnt (%)

benchmarks under certain constraints (area overhead is
35% and extra wire length of P2P channels is 50%).
Figure 9 shows the performance comparison in five
benchmarks. The performance difference between the
proposed algorithm and traversal is 2.33% on average.
On the other hand, those of the greedy algorithm and the
SAA are 9.93% and 6.83%, compared with traversal.
The traversal algorithm fails in the Random3 (R3) and
Media cases. Table 4 shows the running time in five
benchmarks. The running time of traversal increases
exponentially, while that of the proposed algorithm is
always stable (less than 17 s), as the benchmark size
(accelerators and edges) increases.
We now compare the proposed algorithm with the
SAA with different iteration numbers, as shown in Fig.
10. For the proposed algorithm, the iteration number is
the total iteration number of the improved SAA in all
subgraphs (NumK  K). The benchmark is the R1 case,
and the constraints are fixed as follows: area overhead
is 35% and extra wire length of P2P channels is 50%.

Performance
improvemnt (%)

654

SAA
SSA

Proposed

50

45
40
35

20

40

80

160

Iteration number

320

640

Fig. 10 Performance improvement with different iteration
numbers.

5.4

Analysis of simultaneous optimization effects
under two factors

Finally, we analyze the simultaneous optimization
effects with the proposed algorithm under two factors:
Schedule and benchmark patterns.
5.4.1

Analysis of Schedule patterns in the R1 case

First, we analyze Schedule patterns. According to the
system model in Section 2.2, Schedule contains all
application subgraphs and determines the subgraph
priorities, which affects the accelerators’ priorities, and
has an effect on the optimization effects. Thus, we use
the R1 case to analyze the optimization effects with
five kinds of Schedule patterns, as follows. Schedule1
is the periodic pattern, which means all application
subgraphs execute repeatedly; Schedule5 is the burst
pattern, which means that one application subgraph
executes continuously. From Schedule1 to Schedule5,
the intervals of the same subgraphs decrease.
Schedule1 D fG 1 ; G 2 ; G 3 ; G 1 ; G 2 ; G 3 ; G 1 ; G 2 ; G 3 g;
Schedule2 D fG 1 ; G 2 ; G 3 ; G 2 ; G 1 ; G 3 ; G 3 ; G 1 ; G 2 g;
Schedule3 D fG 1 ; G 2 ; G 2 ; G 2 ; G 1 ; G 3 ; G 1 ; G 3 ; G 3 g;
Schedule4 D fG 1 ; G 2 ; G 2 ; G 2 ; G 1 ; G 1 ; G 3 ; G 3 ; G 3 g;
Schedule5 D fG 1 ; G 1 ; G 1 ; G 2 ; G 2 ; G 2 ; G 3 ; G 3 ; G 3 g:
Figure 11 presents the simultaneous optimization
effects of different patterns under certain constraints
(area overhead is 10% and extra wire length of P2P
channels is 20%). The left Y axis shows the total
execution delay (Tsum ) and the right Y axis presents
the performance improvement. Schedule1 (periodic
pattern) has the smallest execution delay (1701 cycles)
before optimization (Non-opt.) while Schedule5 (burst
pattern) suffers from the worst execution delay

16

1900
1850
1800
1750

Non-opt.

14.8

12.4

Optimized

12

10.2

Perf.imp

14

10

8.43
7.59

8

1700

6

1650

4

1600

2

1550

0
1
(Periodic)

2

3

4

5
(Burst)

Schedule pattern

Fig. 11 Performance improvement under different Schedule
patterns.

(1889 cycles). Moreover, as the intervals of the
same subgraphs in different patterns decrease, the
performance improvement increases (from 7.59% to
14.8%). Schedule patterns with smaller intervals
of the same subgraphs bring worse conflicts, as
more accelerators with the same input and execution
patterns execute continuously. In this situation, other
accelerators need to wait until these accelerators
finish. Therefore, the simultaneous optimization that
we proposed for conflict reduction is more effective
with the Schedule patterns, where the same subgraphs
have smaller intervals.
5.4.2

Analysis of benchmark patterns in the Media
cases

Now we analyze the benchmark patterns. We use cm to
define benchmark patterns as follows:
3
" M N
# ,2 M N N
X XX
XX
5
4
trm
cm D
tem
i;j .li;j /
i .pi /
mD1 i D1

mD1 i D0 j D1

(19)
where cm is the ratio of the total accelerator
execution delays (computing related) and the total
data transmission delays (memory related). tem
i .pi /
stands for the execution delay of ACCi in G m , which
contains input, execution, and output stages. The delay
is related to accelerator parallelization (pi ); trm
i;j .li;j /
denotes the input delay for ACCj to read data from a
former accelerator ACCi in G m . The delay is related
to P2P interconnect insertion (li;j ). The two delays are
calculated by Eqs. (A1) and (A2) in Appendix. Thus,
the benchmark is computing-intensive, if cm>1. It
means that the accelerator execution delay dominates
the total execution delay of the SoC (Tsum ); otherwise,
the benchmark is memory-intensive. It means that the
data transmission delay dominates Tsum .

Performance
Improvement (%)

655
Performance improvement (%)

Compared with SAA, the proposed algorithm gets the
same results with fewer iterations. For example, it
gets the optimal result with just 160 iterations, while
the SAA needs 640 iterations. With the same iteration
numbers, the proposed algorithm usually gets much
better results. The performance difference between
the two algorithms is 8.50% at most. In general, the
proposed algorithm, which searches all subspaces in
parallel based on the optimized initial solutions and the
optimization directions, is much better than the SAA
under limited iteration numbers.
From the validations, we conclude that our proposed
algorithm, which is almost as good as traversal, is
always effective in any situation with small running
time.

Total execution delay (cycle)

Daming Zhang et al.: Simultaneous Accelerator Parallelization and Point-to-Point Interconnect Insertion   

Total Execut

1650
1600
1550
1500

14
12

10.2

Perf.imp

10

8.43
7.59

Acceleration parallelization

15

27

80

36

40

41

41

60
40

cm
6
5
4

85

73

3
64

60

59

59

20

2

0

Computation to
memory ratio (cm)

Decomposed performance
improvement (%)

P2P interconnect insertion

1
10

20

30

40

50

30

40

50

60

70

Tsinghua Science and Technology, December 2015, 20(6): 644-660

8
We analyze
the benchmark patterns in the Media
1700
6
case (computing
intensive), where the decomposed
1650
4
performance
1600 improvement is used as follows. 2 As
0
1550 parallelization
accelerator
and P2P interconnect
1
2
3
4
5
insertion interact
we(Burst)
get the
(Periodic)with each other,
Schedule pattern
performance improvement with single techniques
(acceleration parallelization or P2P interconnect
insertion). Then, we normalize the sum of their
performance improvements, and calculate the
percentage of each technique’s improvement as
its decomposed performance improvement. Thus,
a technique with larger decomposed performance
improvement is more effective for optimization.
In Fig. 12, the percent of area overheads is shown
in the X axis, while the extra wire length of P2P
channels is fixed at 100%. The left Y axis shows
the decomposed performance improvement, and the
right Y axis presents the computing-to-memory ratio of
the benchmark. At the point of 10% area overhead,
cm of the Media case is as high as 5.58, and the
accelerator execution delay is the main bottleneck. In
this situation, acceleration parallelization is much more
useful than P2P interconnect insertion. As the area
overhead increases, cm of the Media case decreases
and the decomposed performance improvement of
P2P interconnect insertion becomes larger. But at
some points of area overheads (50% and 60%), the
decomposed performance improvements do not change
any more, as all the accelerators get their maximum
parallel degrees and all the edges have P2P channels
on the critical path. The memory intensive benchmarks
(R2 case, etc.) behave similarly.
For further exploration, we analyze the constraint of
P2P wire length under certain area overheads. Figure
13 presents the performance improvement of the Media
case under 30% area overhead. The percents of the extra
P2P wire length are shown on the X axis. As the percent

100

20

60

Area overhead (%)

Fig. 12
Decomposed performance improvement in the
Media case (the extra wire length of P2P channels is fixed
to 100%).

Sub-optimal

Optimal

Wasted

45
42
39
36
33
30
10

20

30

40

50

60

Extra wire length of P2P channels (%)

Fig. 13 Performance improvement in the Media case (the
area overhead is fixed to 30%).

of P2P wire length increases (from 10% to 40%), the
performance improvement increases. The optimal value
is 43.2, at the 40% point. The performance cannot
be improved any more when the P2P wire length is
larger than 40%, which means the extra P2P wire
length is wasted. Benchmarks under a certain constraint
of the P2P wire length behave similarly. Thus, when
one constraint (area overheads or extra wire length of
P2P channels) is fixed, the smallest value of another
exists, where we achieve the optimal performance
improvement. In other words, we achieve the optimal
performance improvement with the smallest overheads.
Therefore, designers can choose suitable constraints for
specific simultaneous optimization effects on the busbased embedded SoCs.

6

Conclusions

To meet the performance challenges of low-power
embedded bus-based SoCs, this paper focuses on
developing techniques for simultaneous accelerator
parallelization and P2P interconnect insertion. Both
provide architecture-level design controls to enhance
performance. Since joint optimization leads to a
prohibitively large design space to explore, we
propose a system model, and an algorithm that
searches all optimization subspaces in parallel based on
P2P Interconnect
Insertion
Acceleration
Parallelization directions.
cm
optimized
initial solutions
and optimization
1
100
Experimental results show that the performance
0.8
80
difference
between
our
and the
55algorithm
55
56
58 proposed
63
67
0.6
60
optimal
result is only 2.33% on average, while
the
0.4
40
running time of the algorithm is less than 17 s.
20

38

33
Acknowledgements

42

44

45

45

0.2

Computation to
Memory Ratio

1750

14.8

12.4

Optimized

Performance
improvement (%)

1800

Non-opt.

0

0

30
40
This work10 was 20 supported
in part50 by 60the National
Extra P2P Channel (%)
Natural Science Foundation of China (No. 61271269),
the National High-Tech Research and Development (863)
Program
1900 (No. 2013AA01320), and the Importation and
otal Execution Delay (Cycle)

1850

10

Area Overheads and Extra P2P Channels (%)

Performance
Improvement Ratio (%)

16

1900

Performance improvement (%)

656

Total execution delay (cycle)

Non-opt

1850

Burst
Perodic

1800
1750
1700
1650
1600
1550

Daming Zhang et al.: Simultaneous Accelerator Parallelization and Point-to-Point Interconnect Insertion   

Development of High-Caliber Talents Project of Beijing
Municipal Institutions (No. YETP0102).

References
[1]

P. Ma, P. Liu, K. Li, Y. Zou, A. An, Y. Wang, and Y. Hao,
A parallel low latency bus on chip for packet processing
mpsoc, in International Conference on Solid-State and
Integrated Circuit Technology (ICSICT), 2010, pp. 545–
547.
[2] S. Ahmedy, Z. Wangy, M.Klaibery, S. Ahl, M. Roblewskiy,
and S. Simon, Parallel hardware architecture for jpeg-ls
based on domain decomposition, Proc. SPIE, Applications
of Digital Image Processing, vol. 8499, no. 14, pp. 1–11,
2012.
[3] S. R. Sridhara, M. DiRenzo, S. Lingam, S. J. Lee,
R. Blzquez, J. Maxey, S. Ghanem, Y. H. Lee, R. Abdallah,
P. Singh, et al., Microwatt processor platform for medical
system-on-chip applications, IEEE Journal of Solid-State
Circuits (JSSC), vol. 46, no. 4, pp. 721–730, 2011.
[4] J. Kwong and A. P. Chandrakasan, An energy-efficient
biomedical signal processing platform, IEEE Journal of
Solid-State Circuits (JSSC), vol. 46, no. 7, pp. 1742–1753,
2011.
[5] F. Zhang, Y. Zhang, J. Silver, Y. Shakhsheer, M. Nagaraju,
A. Klinefelter, J. N. Pandey, J. Boley, E. J. Carlson,
A. Shrivastava, et al., A batteryless 19w mics/ism-band
energy harvesting body area sensor node soc, in IEEE
International Solid-state Circuits Conference (ISSCC),
2012, pp. 298–300.
[6] N. Goulding-Hotta, J. Sampson, Q. Zheng, V. Bhatt,
J. Auricchio, S. Swanson, and M. B. Taylor, Greendroid:
An architecture for the dark silicon age, in Asia and South
Pacific Design Automation Conference (ASP-DAC), 2012,
pp. 100–105.
[7] R. Corvino, E. Diken, A. Gamatie, and L. Jozwiak,
Transformation-based exploration of data parallel
architecture for customizable hardware: A jpeg encoder
case study, in Euromicro Conference on Digital System
Design (DSD), 2012, pp. 774–781.
[8] J. Haris and P. Sri, Synthesis of heterogeneous pipelined
multiprocessor systems using ilp: Jpeg case study, in
International Conference on Hardware-Software Codesign
and System Synthesis (CODES+ISSS), 2008, pp. 1–6.
[9] N. Belhadj, N. Bahri, M. B. Ayed, Z. Marrakchi, and
H. Mehrez, Data level parallelism for h264/avc baseline
intra-prediction chain on mpsoc, in Multi-Conference on
Systems, Signals and Devices (SSD), 2013, pp. 1–4.
[10] A. Hagiescu, W. F. Wong, D. F. Bacon, and R. Rabbah, A
computing origami: Folding streams in fpgas, in Design
Automation Conference (DAC), 2009, pp. 282–287.
[11] S. Li, Y. Liu, X. Hu, X. He, Y. Zhang, P. Zhang, and
H. Yang, Optimal partition with block-level parallelization
in c-to-rtl synthesis for streaming applications, in Asia and
South Pacific Design Automation Conference (ASP-DAC),
2013, pp. 225–230.

657

[12] W. Zuo, Y. Liang, P. Li, K. Rupnow, D. Chen, and J. Cong,
Improving high level synthesis optimization opportunity
through polyhedral transformations, in Proceedings of
the ACM/SIGDA International Symposium on Field
Programmable Gate Arrays, 2013, pp. 92–97.
[13] D. Vainbrand and R. Ginosar, Network-on-chip
architectures for neural networks, in International
Symposium on Networks-on-chip (NOCS), 2007, pp.
135–144.
[14] D. Bertozzi, A. Jalabert, S. Murali, R. Tamhankar,
S. Stergiou, L. Benini, and G. D. Micheli, Noc synthesis
flow for customized domain specific multiprocessor
systems-on-chip, IEEE Transactions on Parallel and
Distributed Systems (TPDS), vol. 16, no. 2, pp. 113–129,
2005.
[15] H. G. Lee, U. Y. Ogras, R. Marculescu, and N. Chang,
Design space exploration and prototyping for onchip multimedia applications, in Design Automation
Conference (DAC), 2006, pp. 137–142.
[16] J. Gladigau, A. Gerstlauer, C. Haubelt, M. Streubhr,
and J. Teich, A system-level synthesis approach from
formal application models to generic bus-based mpsocs, in
International Conference on Embedded Computer Systems
(SAMOS), 2010, pp. 118–125.
[17] M. Hempstead, G. Y. Wei, and D. Brooks, An acceleratorbased wireless sensor network processor in 130 nm cmos,
IEEE Journal on Emerging and Selected Topics in Circuits
and Systems (JETCAS), vol. 1, no. 2, pp. 193–202, 2011.
[18] R. Zahir, M. Ewert, and H. Seshadri, The medfield
smartphone: Intel architecture in a handheld form factor,
IEEE Micro, vol. 33, no. 6, pp. 38–46, 2013.
[19] B. Rose, Samsung’s 8-core exynos 5 octa processor:
Your next phone will be fast, http://gizmodo.com/
5974528/samsungs-new-exynos-processor-just-went-octa,
2013,
[20] P. Hauser and H. Olivier, Connected device platform, US
Patent US20130303087A1, Nov. 14, 2013.
[21] R. Bassam and M. Toni, Home automation system:
A cheap and open-source alternative to control
household appliances, http://www.diva-portal.org/smash/
get/diva2:679674/FULLTEXT01.pdf, 2013.
[22] H. G. Lee, N. Chang, U. Y. Ogras, and R. Marculescu,
On-chip communication architecture exploration: A
quantitative evaluation of point-to-point, bus, and networkon-chip approaches, ACM Transactions on Design
Automation of Electronic Systems (TODAES), vol. 12, no.
3, 2007.
[23] S. Pasricha, N. Dutt, and M. Ben-Romdhane, Constraintdriven bus matrix synthesis for mpsoc, in Asia and South
Pacific Design Automation Conference (ASP-DAC), 2006,
pp. 30–35.
[24] S. Tan, F. Qiao, B. Xia, H. Yang, and H. Wang, A
functional model of systemc-based mpeg-2 decoder with
heterogeneous multi-ip-cores and hybrid-interconnections
architecture, in International Congress on Image and
Signal Processing (CISP), 2009, pp. 1–5.

Tsinghua Science and Technology, December 2015, 20(6): 644-660

658

[25] C. Pham-Quoc, J. Heisswolf, S. Werner, Z. Al-Ars,
J. Becker, and K. Bertels, Hybrid interconnect design
for heterogeneous hardware accelerators, in Design,
Automation and Test in Europe Conference and Exhibition
(DATE), 2013, pp. 843–846.
[26] D. Vainbrand and R. Ginosar, Network-on-chip
architectures for neural networks, in Symposium on
Networks-on-Chip (NOCS), 2010, pp. 135–144.
[27] W. Zhu, L. Liu, S. Yin, Y. Dong, S. Wei, E. Y. Tang,
J. Song, and J. Peng, A 65 nm uneven-dual-core soc based
platform for multi-device collaborative computing, in
International Symposium on Circuits and Systems (ISCAS),

2014, pp. 2527–2530.
[28] Y. Wei, C. Sze, N. Viswanathan, Z. Li, C. J. Alpert,
L. Reddy, A. D. Huber, G. E.Tellez, D. Keller, and S.
S. Sapatnekar, Glare: Global and local wiring aware
routability evaluation, in Design Automation Conference
(DAC), 2012, pp. 768–773.
[29] MIT, 48 half-hour excerpts of two-channel ambulatory
ecg recordings, http://www.physionet.org/physiobank/
database/mitdb/, 2013.
[30] Y. Zhang, Image Engineering (I) Image Processing (2nd
ed.). Beijing, China: Tsinghua University Press, 2009.

Appendix: Graph-based Tsum estimation
Nnum C Lnum . Nnum is the number of accelerators in V time
and Lnum is the number of edges in E time .
To estimate Tsum , we need to calculate all the delays in
time
E . In the following subsections, we define two basic
delays and calculate three kinds of delays.
In order to calculate three kinds of delays in E time , Fig.
A2 presents two basic delays. One is input delay and the
other is execution delay.
trm
denotes the input delay for ACCi to read data from
h;i
the main memory (if h D 0) or a former accelerator ACCh
in G m . The delay is related to P2P interconnect insertion
(li;j ) and is calculated as follows:
(
m
bt  dh;i
C bset; if lh;i D 0jjh D 0;
m
trh;i .lh;i / D
m
pt  dh;i ;
else;

A graph-based method for effective Tsum estimation is
presented as follows.
We use a directed acyclic graph G time .V time ; E time /
to describe delays of data transmission and accelerator
execution in Fig. A1. G time is merged by all subgraphs
based on Schedule. V time contains all accelerators. E time
is the edge set, where each edge stands for the delay
of connected accelerators. E time contains three kinds of
delays: data transmission delay in each subgraph, and bus
and accelerator conflict delays among subgraphs. Based
on G time , Tsum is the delay of the critical path (from
Start to End), where all accelerators in Schedule have
finished at that moment. The problem to find the critical
path in a directed acyclic graph (G time ) is solved by a
topological sorting algorithm, whose complexity is linear:
Bus path

P2P path

G1

G3

G2
4

Schedule
 {GG1 ,3G 2 , G 3 , G 4 }G 4
G2

2

1

G4

4

1

Feedback
loop

2

5

5

1
6

4

3

2

i 2 Œ1; N ; h 2 f0; Œ1; N g; m 2 Œ1; M 

Schedule  {G1 , G 2 , G 3 , G 4 }

P2P path

4

Level 17

m
The delay includes the transmission delay (bt  dh;i
) and
setting delay (bset) of the bus, if there is no P2P channel
(lh;i D 0), or ACCi reads data from main memory (h D
0). Otherwise, the delay is equal to the transmission delay
m
of a P2P channel (pt  dh;i
).
tem
stands
for
the
execution
delay of ACCi in G m ,
i
which contains input, execution, and output stages. The
delay is related to accelerator parallelization (pi ) and is
calculated as follows:
˙ m

tem
i .pi / D .ini C exei C outi /  ni =pi C
max.ini ; outi /  Œ.nm
1/%pi ;
i

Level 1

8

Level 2
Level 3

7

Level 2
7 application
8 subgraphs
Separate

6

Merge Level 3

7

conflict delay
Separate applicationBussubgraphs

Data transmission delay
Critical path

Accelerator conflict delay

Merge

G2

G1

G4

G3

tr m

1

G

4

2

2

3

1

5

Data transmission
delay
2
6
2
Critical path

3

33

G

G

7

8

3

G

G

2

7

G

G

ji

Bus Conflict Delay
Accelerator Conflict Delay

4

h

j

i 2
wim, h, m G

Level 1

m
Level 2
te
i

i

(B)

Input delay

Fig. A2

Schedule  {G17
, G 2 , G 3 ,8G 4 }
3

tr

(A)

(A)

G1
1

wi , h

m
h, i

h h
Main memory / ACC

5
Graph-based
Tsum4estimation.

Bus Path
P2P Path6

2
2

(B) Execution Delay

wj,h

wi , h

47

Merged delay graph (G time )

Fig. 1A1

Input Delay

The delay consists of two parts. The first part is the total
execution
˙ m
 delay of ACCi with parallel degree (pi ), where
ni =pi is them,execution
second part is the
j m , mtimes. iThe
i
m, m
m

4

End

Feedback loop

4

(A2)

h

(A)

2

te m

h,i
i
MainpMemory;
i ; mi 2 Œ1;
pi 2 Œ1;
M
i max / ACCi 2 Œ1; N

Start

ct delay
or conflict delay

(A1)

(B) Execution delay

Two basic delays.
Data Transmission Delay
Critical Path

wmj , ,hmG

i

3

Gim, h, m
w
4

Level 1

Start
1

4

h

1

5

j

4

Level 2

Daming Zhang et al.: Simultaneous Accelerator Parallelization and Point-to-Point Interconnect Insertion   

communication conflict delays, where .nm
1/%pi is the
i
number of conflicts.
According to the basic delays, we calculate three
kinds of delays: data transmission delays, and bus
m;m
and accelerator conflict delays. wi;j
denotes the data
transmission delay for ACCj to read data from ACCi in
G m . In general, it is calculated as follows:
m;m
m
wi;j
D tem
(A3)
i C tri;j ; i; j 2 Œ1; N ; m 2 Œ1; M 
m
which includes the execution delay of ACCi (tei ) and the
input delay for ACCj to read data from ACCi (trm
i;j ).
Data transmission delays are calculated in different
ways, if the edges are in branch patterns. Figure A3
presents two kinds of branch patterns. In pattern (A),
m;m
ACCh reads data from both ACCi (wi;h
) and ACCj
m;m
(wj;h ). We calculate the delays as follows.
m;m
m
m
wi;h
D tem
i C .tri;h C trj;h /;
m;m
m
wj;h
D tejm C .trm
C trj;h
/;
i;h
i; j; h 2 Œ1; N ; m 2 Œ1; M 

(A4)

Each delay contains two parts. One is the execution delay
m
of the former accelerator (tem
i or tej ). The other is the
m
input delay for ACCh totrread
data from both
ACCi and
teim
h, i
m
m/ ACCh
memory
i
i
ACCMain
(tr
C
tr
).
It
is
the
worst-case
input
delay,
as the
j
i;h
j;h
reading conflict
ACCi and
j is considered.
(A) between
Input delay
(B)ACC
Execution
delay
In pattern (B), ACCj and ACCh read data from the same
accelerator ACCi . ACCj has a higher graph priority than

wim, h, m

j

i

i

wmj , ,hm

j

h

G2
4
2

2

2

3

3

Level 2

h

dback Loop

(B) Merged Delay

DataZhang
Transmission
Delay
Daming
received his
BEng degree
fromCritical
Tsinghua
University
in
2010 and
Path
now is a PhD candidate in Tsinghua
G 4 are highG 3 His research interests
University.
level synthesis and architecture design of
embedded systems. He is now working in
the1 project5of a self-powered
4 WSN sensor
platform design.

6

7

ACCh (j < h). Therefore, the data transmission delay
m;m
(wi;h
) for ACCh to read data from ACCi contains an
extra part: the input delay for ACCj to read data from
ACCi (trjm ). It is calculated as follows.
m;m
m
m
wi;h
D tem
i C tri;j C tri;h ;
i; j; h 2 Œ1; N ; m 2 Œ1; M 

(A5)

The delay calculation in branch patterns is also suitable
for application subgraphs with N branches (N > 2).
m;n
Bus conflict delay: wi;j
denotes the bus conflict delay
for both ACCi and ACCj to read data through the bus. As
ACCi has a higher graph priority than ACCj , ACCj reads
data through the bus when ACCi has done. The delay is
equal to trm
(h D 0) in Eq. (A1), which is the input delay
h;j
for ACCj to read data through the bus.
m;n
Accelerator conflict delay: wi;i
stands for the
accelerator conflict delay for ACCi to execute in both G m
and G n . As ACCi in G m has a higher graph priority than
it in G n , the delay is calculated as follows:
N
N
X
X
m;n
wi;i
D tem
trm
trnh;i ;
i C
i;j C
j D1

hD0

i 2 Œ1; N ; m; n 2 Œ1; M 

(A6)

The delay consists of three parts. The first part is the
execution delay of ACCi (tem
i ). The second part is the
input delay for
all
other
accelerators
to read data from
0
1
N
X
A
ACCi in G m @
trm
i;j . The third part is the input delay
j D1

Delay calculation in branch patterns.

Conflict Delay
lerator Conflict Delay

G1

Level 1

(B)

(A)

Fig. A3

wim, h, m

659

8

Shuangchen Li received his BEng and
MEng degrees from Tsinghua University
7
in 2011 and 2014, respectively. He is a
PhD candidate in University of California,
End Barbara. His research interests
Santa
are system-level
Graph
(G time ) synthesis, nonvolatile
processor-based energy harvesting system,
and nonvolatile processors.

for ACCi to read data from!the main memory or other
N
X
trnh;i .
accelerators in G n
hD0

Huazhong Yang received his BEng degree
in 1989, MEng and PhD degrees in
1993 and 1998, respectively, all from
Tsinghua University. In 1993, he joined
the Department of Electronic Engineering,
Tsinghua University, where he is a
full professor since 1998. Dr. Yang is
a specially-appointed professor of the
Cheung Kong Scholars Program. His current interests include
wireless sensor networks, data converters, parallel circuit
simulation algorithms, and nonvolatile processors and energyharvesting circuits. Dr. Yang has authored and co-authored over
300 technical papers and 70 granted patents.

660

Tsinghua Science and Technology, December 2015, 20(6): 644-660

Yongpan Liu received his BEng, MEng,
and PhD degrees from Tsinghua University
in 1999, 2002, and 2007, respectively. He is
now an associate professor in Department
of Electronic Engineering, Tsinghua
University. His main research interests
include low power VLSI design, emerging
device based circuits and systems, and
design automation. He has published over 50 peer-reviewed
conference and journal papers and led over 6 SoC design projects
for sensor applications.

Tongda Wu received his BEng degree
from Tsinghua University in 2014 and
now is a master student in Tsinghua
University. His research interests are low
power VLSI designs and self-powered
SoC designs.

