Design Methodology for Energy Efficient Unmanned Aerial Vehicles by He, Jingyu et al.
Design Methodology for Energy Efficient Unmanned Aerial Vehicles
Jingyu He1, Yao Xiao1, Paul Bogdan1, and Corina Bogdan2
1Department of Electrical Engineering, University of Southern California
2Department of Electrical and Computer Engineering, Northeastern University
Abstract— In this paper, we present a load-
balancing approach to analyze and partition the UAV
perception and navigation intelligence (PNI) code for
parallel execution, as well as assigning each paral-
lel computational task to a processing element in an
Network-on-chip (NoC) architecture such that the to-
tal communication energy is minimized and conges-
tion is reduced. First, we construct a data depen-
dency graph (DDG) by converting the PNI high level
program into Low Level Virtual Machine (LLVM) In-
termediate Representation (IR). Second, we propose a
scheduling algorithm to partition the PNI application
into clusters such that (1) inter-cluster communication
is minimized, (2) NoC energy is reduced and (3) the
workloads of different cores are balanced for maximum
parallel execution. Finally, an energy-aware mapping
scheme is adopted to assign clusters onto tile-based
NoCs. We validate this approach with a drone self-
navigation application and the experimental results
show that we can achieve up to 8.4x energy reduction
and 10.5x performance speedup.
I. Introduction
Unmanned aerial vehicles (UAVs) are emerging as crit-
ical tools for mapping large areas, patrolling, search-
ing, and rescuing applications. These tasks are usu-
ally dangerous, repetitive and have to be carried out in
extreme conditions, making them ideal for autonomous
drones. Self-navigation and collision-avoiding applica-
tions are key for UAVs to operate individually and rely
on high-performance and low-power computing edges.
We cannot stress the importance of the performance of
flight control applications enough. In a recent investiga-
tion [12], the Federal Aviation Administration discovered
that the lack of data-processing speed of a specific flight
control computer chip has led to two Boeing 737 Max
crashes in 2019 that killed 346 people. At the same time,
low-power design is critical for UAVs as well. One reason
is that high power dissipation brings tremendous cooling
challenges to maintain the hardware at a suitable tem-
perature. Another is that batteries are the only energy
source for drones, limiting the running time of drones.
In order to push the performance and energy bound-
ary of systems-on-chips, Dally and Towles [7] proposed
the tile-based Network-on-chips (NoC) as the ideal ar-
chitecture for scalable and low-power on-chip communi-
cation. Such chips use tiles as building blocks such as
CPUs, GPUs, ASIC and memory. A standard interface is
embedded into each tile to route flits for communication.
There have been many previous studies on energy-aware
NoC designs. In contrast to prior NoC work, the goal
of this paper is to investigate the parallelization of the
UAV perception and navigation intelligence while taking
the computation and communication power consumption
into consideration. As shown in Fig. 1, we first compile
the navigation program into LLVM IR and construct the
DDG, where each node denotes only a useful instruction
with its power consumption and each edge represents the
data dependency with the weight being data size times
latency. Second, based on DDG graph, we propose a
scheduling algorithm to partition the PNI application into
clusters such that (1) inter-cluster communication is min-
imized, (2) NoC energy is reduced and (3) the workloads
of different cores are balanced for maximum parallel ex-
ecution. Finally, we incorporate topological sort into the
our energy-aware mapping scheme to further reduce static
power consumption resulted by congestion.
Towards this end, the main contributions of this paper
are as follow:
• To the best of our knowledge, our work is the first to
incorporate the static energy consumption analysis
of application into a compiler-based task partition.
• Besides volume, we propose a mapping strategy to
also consider the timing of inter-core communica-
tions, reducing the congestion time and static energy
consumption of hardware resources.
The rest of the paper is organized as follows: Section II
discusses the related work. Section III introduces the ba-
sics of UAV control. Section IV illustrates the energy
model for NoCs, the load-balancing and energy-aware
community detection algorithm, and the low-power map-
ping. Section V validates the framework and shows ex-
perimental results compared to the baseline model.
ar
X
iv
:1
90
9.
11
23
8v
1 
 [c
s.D
C]
  2
5 S
ep
 20
19
Fig. 1.: Overview of the UAV intelligent processing architecture workflow. (A): An UAV and its control basics. (B):
The perception and navigation intelligence application (as a high level program) is compiled into LLVM IR trace
through compiler analysis. This allows to remove the unnecessary computation and communication overhead of high
level programs. (C): We transform the trace into the DDG and detect communities. (D): Each processing community
is mapped onto an NoC processing element in such a way that its communication energy is minimized and congestion
is reduced. The unused cores are clock-gated to save energy, indicated by the blue tiles.
II. Related Work
There has been a significant amount of previous re-
search on energy-aware and load-balancing scheduling and
mapping on multicore embedded systems. From a math-
ematical and control perspective, Bogdan et al. in [4, 5]
provide a complex approach to dynamically character-
ize the workload of multicore systems for performance
and power optimization. Xiao et al. propose a complex
network-inspired application partitioning tool to improve
multicore parallelization [15]. Tan et al. develop a low-
power customizable manycore architecture for wearables
using a lightweight message-passing scheme [14]. Navion
[13] design an energy-efficient accelerator to fully integrate
visual-inertial odometry system-on-chip while eliminating
expansive off-chip processing and storage for autonomous
navigation of drones. In terms of mapping and routing,
an efficient branch-and-bound algorithm proposed by Hu
et al. [9] automatically maps the IPs onto a generic NoC
so that the communication cost is minimized while the
timing constraint is met. In contrast to prior work, we
present an energy-aware load-balancing community de-
tection algorithm together with a mapping strategy and
test it using a UAV self-navigation application.
III. Brief Overview of the Basics of the UAV
Navigation Controller
Fig. 1(A) shows a UAV with six degrees of freedom.
Three degrees of freedom describe the translational mo-
tions (x, y, z) and the other three are the rotational mo-
tions (r, p, q). Each of the four propellers is equipped
with a rotor providing the angular velocity. These four
angular velocities correspond to the inputs of the quadro-
tor, ωi = [ω1, ω2, ω3, ω4]. Twelve outputs are generated
from the quadrotor, X = [x, y, z, r, p, q, x˙, y˙, z˙, r˙, p˙, q˙], cor-
responding to the translational and rotational positions,
and their corresponding velocities [6].
For real-time applications, the error between the actual
UAV position, estimated by a navigation system, and the
desired position is fed into a PD-controller to determine
the required control inputs. The required rotor speeds are
then calculated from the respective torques using:
(
T
Γ
)
=

−b −b −b −b
0 −db 0 db
−db 0 db 0
k −k k −k


ω21
ω22
ω23
ω24
 (1)
where T is the thrust vector for each propeller, Γ is the
torque vector applied to the airframe, b represents the lift
constant, d is the distance from the rotor to the center
of the mass and k is secondary lift constant. The control
structure employed to fly the quadrotor can be found in [6,
2], and is based on Proportional Derivative action to get
the quadrotor’s attitude (roll, pitch, yaw) and altitude.
IV. Parallelization Discovery and Energy
Optimization Approach
A. Energy Model
Both IP cores and interconnection consume energy.
While most of the mapping algorithms based on the one
in [9] only compute dynamic energy, our model consid-
ers both static and dynamic power dissipation. N. Grech
et al. [8] propose an application static energy analysis
technique to determine the instruction energy model di-
rectly at the LLVM IR level. Through analysis and mea-
surement of a large set of target ISA instructions, it was
found that LLVM IR instructions can be divided roughly
into four groups: memory, M , program flow, B, division,
D, and all other instructions, G. This yields an energy
model EN of a program executed sequentially in a com-
puting node:
EN =
n∑
i∈{M,B,D,G}
EiNi (2)
where Ei is the energy cost of a single instruction in group
i, Ni is the number of the instructions executed in that
group, and n denotes the number of instructions.
Using the bit energy concept proposed by Ye et al. in
[16], the total dynamic energy consumption can be com-
puted by:
EDyNoC =
a∑
i=1
b∑
j=1
wij(ηijESbit + (ηij − 1)× ELbit) (3)
where ESbit and ELbit represent the energy consumed by
switch and link; ηij is the number of routers the packet
from tile τi to tile τj passes through along the way; wij is
the size of the packet; a and b denote the number of tiles
on x and y respectively.
The static power is defined to characterize the energy
consumed when packets are congested in the buffers. For
simplicity, static power is defined as:
EStNoC =
n∑
i=1
PSt × wi × ti (4)
where n is the number of times that congestion occurs;
PSt is the energy consumption of one bit of data stored
in the buffer for one unit of time; wi is the data size of
the ith congestion; and ti is time of the ith congestion.
Equation (5) gives the total energy consumption for the
interconnect.
ENoC = EStNoC + EDyNoC (5)
Finally, given the total number of tiles n, the energy
consumption of the entire chip is computed as:
E =
n∑
i=1
ENi + ENoC (6)
B. Compiler Analysis and Model of Computation Extrac-
tion
In order to generate the data dependency graph (DDG),
we adopt the LLVM IR [10]. The rationale behind this
is that LLVM is a language-independent system that ex-
poses the commonly-used primitives to implement high-
level language features, which makes it very easy to gen-
erate back-end for any target platform.
With the help of Clang, C/C++ applications are com-
piled into a dynamic IR execution trace. We developed
a parser to construct a data dependency graph from the
IR trace. The parser analyzes memory operations to ob-
tain latency and data sizes. Because the execution times
and energy vary on data sizes and where the data re-
sides, taking those values into account could potentially
reduce inter-core communications by grouping the source
and destination instructions of a register into one cluster.
Three hash tables are created and updated when parsing:
the source table, the destination table and the depen-
dency table. The source/destination tables are used to
keep track of source/destination registers with keys be-
ing source or destination registers and values being the
TABLE I
: The source, destination and weight tables
LLVM IR trace
store double %5, double* %1, align 8
%2 = load double, double* %1, align 8
%3 = load double, double* %6, align 8
%4 = fcmp oeq double %2, %3
Src Table Dest Table Dependency Table
Key Value Key Value Key Value
%5 1 %1 1 2 1
%1 2 %2 2 4 2,3
%6 3 %3 3
%2, %3 4 %4 4
corresponding line number. The dependency table is to
store dependencies between nodes with keys being the line
number for current instruction, and values being clock cy-
cles, data sizes and line numbers of previous instructions
dependent on the same virtual register.
For example, in Table I, a LLVM IR snippet is extracted
from an application compiled by Clang front-end. As the
parser reads the first line, a source table and a destination
table are created. The source table is updated with the
key being %5 and the value being 1 and its destination
register is hashed into the destination table with the key
being %1 and value the being 1. When line 2 is read, the
source register %1 happens to be the destination register
in line 1. A dependency table is created and updated with
the key being 2 (line number of current instruction) and
value being 1 (line number of the dependent instruction).
Following the same procedure, the three hash tables will
look like what is shown in Table I.
C. Discovering the Processing Community Structure
To formulate this problem, we introduce the following
concepts:
Definition 1. A data dependency graph (DDG) is a
weighted directed graph G = G(ai, bij , ei, wij |i, j ∈ N |)
where each vertex ai represents one LLVM IR instruction;
each edge bij with weights wij characterizes either the de-
pendency from ai to aj or the control flow such as jumps
or branches from one block to another; and ei stands for
the estimated energy of the vertex given in Section IV.A.
Definition 2. A weight wij between ai and aj is calcu-
lated by latency times data size. Latency characterizes
the delay from ai to aj based on the timing information.
Data size represents the number of bytes transferred.
Definition 3. A quality function determines how effi-
cient the LLVM IR instructions are grouped together in
terms of energy consumption, parallelism, load balancing,
hardware utilization and inter-cluster data movements.
The discovery of the processing community structure
problem can now be formulated as follows: Given
a DDG, find non-overlapping processing communities
which maximize the quality function:
Q =
nc∑
c=1
(
(Wc − Sc)
W
− (Wc −W )
2
W
)−
∑nc
c=1ENc + EL
E
(7)
and satisfy:
N ≥ nc (8)
The first term in equation (7) confines the data flow
within the cluster as much as possible. It indicates the
difference between the sum of the weights in a cluster and
the sum of the weights of the edge connected to the clus-
ter. The greater this term is, the fewer inter-cluster data
movements, and the more energy is saved.
The second term in equation (7) measures the standard
deviation squared between sum of weights in cluster c and
average sum of weights in all clusters. Minimizing this
term ensures load balancing and fully takes advantage of
parallel execution.
The third term in equation (7) characterizes the en-
ergy model of the application, where ENc calculates the
energy consumed at each node using Equation (2) and
EL computes the energy consumption for communication
transactions. To maximize quality Q, this term needs to
be minimized in order to save energy.
D. Compact Intelligence Mapping into Constrained Hard-
ware
The tile to which each cluster is mapped significantly
affects the power consumption of the application since it
determines the dynamic and static communication cost.
Consequently, an approach, which is similar to the one in
[9], is proposed, but it takes cluster ordering into consid-
eration as well so that it reduces static energy consump-
tion caused by congestion and contention of hardware re-
sources.
Definition 4. A task graph (TG) is a weighted directed
acyclic graph TG = G(ci, aij , v(aij), b(aij)|i, j ∈ N |)
where each vertex ci represents a cluster of LLVM IR
instructions that are grouped together by our community
detection algorithm, and each edge aij represents commu-
nication from node ci to node cj .
• v(aij): data size from ci to cj .
• b(aij): bandwidth requirement from ci to cj .
Definition 5. An architecture graph (AG) is a directed
graph AG = G(ti, pij , e(pij))|i, j ∈ N |) where each vertex
ti represents a tile, and each edge pij represents a routing
path from ti to tj .
• e(pij): energy consumption from ti to tj .
• L(pij): set of links that makes up pij
Fig. 2.: Application of a topological sort to task graph.
In order to exploit parallelism and pipelining, we apply
topological sort to the task graph before mapping. The
depth of cluster ci is defined as the maximum number
of edges from the root to ci. In Fig. 2, cluster D cannot
execute before cluster B and C because it needs data from
both of them. However, cluster B and C can execute in
parallel because they are at the same depth.
Algorithm 1: Compact Intelligence Mapping Algo-
rithm
Input: TG and AG
Output: Mapping from TG to AG
1 count = 0
2 while TG is not empty do
3 if count == 0 then
4 Get the cluster with depth of zero and map to
(0,0)
5 else
6 Create a set Scount of all clusters with depth
of count;
7 Map Scount to the available tile in AG so that:
8 min{E = ∑∀ai,j v(ai,j)e(pmap(ci),map(cj))}
9 count+ +
10 if Any idle tile t left in AG then
11 Power gate t
D.1 Energy and Congestion Analysis
The energy-aware mapping proposed in [9] (we refer to it
as H) fails to consider the order of the clusters, leading
to significant potential congestion and static energy con-
sumption in NoCs. This section shows how our algorithm
mitigates this problem.
For illustration purposes, we assume ESbit = ELbit =
1 × 10−12J/bit. Applying the H’s mapping to the TG in
Fig. 2 may yield the following two different mappings in
Table 2. For instance, using Equation (3) in H’s mapping,
EDyAC = 6× (3×ESbit + (3− 1)×ELbit) = 30× 10−12J .
Both mappings’ dynamic energy costs are 109× 10−12J .
In terms of static energy, we assume PSt = 1× 10−12J ,
and the execution time is 10ns for all clusters. Also as-
sume one packet flit is 1bit and the time for a flit to pass
through a switch (ts) is 2ns and a link (tl) is 1ns. Fig.
3 shows the timing diagram of all computations and all
packet deliveries of both mappings. For instance, in H’s
mapping, the first flit of the packet from cluster A to B
TABLE II
: Mapping comparison: dynamic energy
Dynamic energy = 109× 10−12J
H’s mapping Our mapping
takes 2× ts + tl = 5ns to arrive (routing delay), while the
rest of the packet needs another 9ns (packet delay).
Fig. 3.: Mapping comparison: static energy
In H’s mapping, when cluster B finishes execution and
is about to route the packet to D, D’s input buffer is
busy because of A → D and A → C packet transmis-
sions. Thus, B must wait until A→ C is done. While the
two mappings yield the same execution time of 67ns, the
packets from B to D in H’s mapping experiences a 10ns
longer congestion delay, hence consuming more static en-
ergy. Applying Equations (4) and (5), H’s mapping con-
sumes 17% more energy in interconnect.
V. Experimental Results
We use gem5 [3] together with McPAT [11] for architec-
tural and power simulation. Our baseline model is 2-core
ARM processor connected in a 2D mesh topology NoC [1]
with MESI cache protocol. Detailed parameters are listed
in TABLE III.
TABLE III
: Simulation parameters of baseline processor
Cores 2 in-order ARM cores at 500MHz
L1 Private Cache 32KB, 4-way, 32-byte block
L2 Shared Cache 128KB, 8-way
Topology 2D Mesh with XY routing
First, we examine our processing community discov-
ery algorithm’s computational complexity (Fig. 4) as the
number of core grows. The processing community discov-
ery is done offline (only once), so a run time around two
minutes will not affect the controller speed during UAV
navigation. For system sizes under 256, the run time is
roughly only related to the map size and remains constant
as the core number increases. Once the core count passes
a threshold of 256, the run time rises significantly.
Fig. 4.: Run-time of community detection algorithm.
TABLE IV
: TGs of different core count
Row #Core
Inst/core
(SD)
(Inter-core)
flits
Avg
degree
Avg
weight
1 1 16637 24324 2.31 13.87
2 BL 514.9 12023 2932 15.87
3 2 415.9 8497 2646.5 11.48
4 4 199.6 6391 1932.2 12.64
5 8 104.8 4213 748.5 11.88
6 16 53.2 3531 593.5 10.67
7 32 29.1 2919 713.8 9.73
8 64 12.5 1823 293.3 10.58
9 128 13.4 3769 252.9 7.84
10 256 7.3 5322 120.3 12.39
11 512 4.3 14483 45.9 16.99
Fig. 5.: DDG of UAV navigation application.
The statistics of the generated clusters are shown in TA-
BLE IV. The row 1 in the Inst/core (SD) column stands
for the total number of instructions of this application;
starting from row 2, it records the standard deviation of
the number of instructions partitioned into each cluster.
The row 1 in the (Inter-core) flits column stands for the
total number of edges in this application; starting from
row 2, it records the total number of flits needed to be
transported between cores. The baseline is randomly par-
titioned. As the number of cores increases, the inter-core
communication first drops to 913 (86.4% reduction com-
pared to the baseline) edges at 64 cores and then soars to
7482 at 512 cores (11.3% more than baseline). Same on 2
cores, our algorithm reduces the edges by 21.2%. The rea-
son is that our algorithm effectively lowers the inter-core
communication when the core count is less than 64. After
Fig. 6.: Speedup, power and PDP of different core counts.
64, as fewer and fewer instructions are run on each core,
the inter-core message passing increases dramatically.
Next, we evaluate the speedup and power consumption
of our design (Fig. 6). The power values are collected by
feeding the outputs from gem5 to McPAT. Having fully
taken advantage of parallel execution, load-balancing and
optimal inter-community communication, our design has
achieved maximum speedup of around 10.5x at 64-core
architecture and energy savings of 8.4x at 32 cores. The
scalability of this application is roughly under 64 cores
due to the relatively small number of instructions. Map-
ping to 512 cores even yields longer run times and higher
energy consumption because more flits need to be routed
between cores. The delay in Fig. 6 refers to the time to
run one iteration of the next target position calculation.
The minimum power-delay product is achieved by the 32-
core configuration at 5.56µS ∗mW , 39.3x lower than the
baseline power delay product (PDP) of 219.6µS ∗ mW .
It is noted that map size hardly affects the run time and
power, as simulations run on three different map sizes are
approximately the same.
TABLE V
: Power consumption of DJI flight controllers
Model Max Power Normal Power
DJI ACE ONE 5W 3.2W
DJI NAZA-H 3.2W 1.5W
DJI NAZA-M LITE 1.5W 0.6W
DJI NAZA-M V2 1.5W 0.6W
Finally, we illustrate the potential of our design by com-
paring it with the state-of-art flight controllers used in DJI
drones. As shown in Table V, NAZA-M LITE has the low-
est power consumption among the other controllers with
a max power of 1.5W and a normal power of 0.6W. Our
design consumes significantly less energy compared to the
DJI’s controllers.
VI. Conclusion
In this paper, we first develop an LLVM IR parser
to construct the DDG for UAV autonomous navigation
application. Next, we analyze the DDG structure and
discover its best parallelization degree by applying our
load-balancing and energy-aware processing community
discovery algorithm so that data movement is confined
within clusters and static energy consumption is mini-
mized. Finally, a congestion-aware mapping scheme based
on topological sort is proposed to map clusters onto the
NoCs for parallel execution. Simulations show that our
optimal 32-core design achieves an average 8.4x energy
savings and that 64-core configuration achieves 10.5x per-
formance speedup.
References
[1] N. Agarwal, T. Krishna, L.-S. Peh, and N. K. Jha.
Garnet: A detailed on-chip network model inside a
full-system simulator. In 2009 ISPASS, pages 33–42.
IEEE, 2009.
[2] S. Armah, S. Yi, W. Choi, and D. Shin. Feedback
control of quad-rotors with a matlab-based simula-
tor. American Journal of Applied Sciences, 2016.
[3] N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt,
A. Saidi, A. Basu, J. Hestness, D. R. Hower, T. Kr-
ishna, S. Sardashti, et al. The gem5 simulator. ACM
SIGARCH Computer Architecture News, 39(2):1–7,
2011.
[4] P. Bogdan. Mathematical modeling and control of
multifractal workloads for data-center-on-a-chip op-
timization. In Proceedings of the 9th NOCS, page 21.
ACM, 2015.
[5] P. Bogdan and Y. Xue. Mathematical models and
control algorithms for dynamic optimization of mul-
ticore platforms: A complex dynamics approach.
In Proceedings of the ICCAD, pages 170–175. IEEE
Press, 2015.
[6] P. Corke. Flying Robots Book: Robotics, Vision and
Control. Springer, 2017.
[7] W. J. Dally and B. Towles. Route packets, not wires:
on-chip inteconnection networks. In Proceedings of
the 38th DAC, pages 684–689. Acm, 2001.
[8] N. Grech, K. Georgiou, J. Pallister, S. Kerrison,
J. Morse, and K. Eder. Static analysis of energy
consumption for llvm ir programs. In Proceedings of
the 18th SCOPES, pages 12–21. ACM, 2015.
[9] J. Hu and R. Marculescu. Exploiting the routing
flexibility for energy/performance aware mapping of
regular noc architectures. In 2003 DATE, pages 688–
693. IEEE, 2003.
[10] C. Lattner and V. Adve. Llvm: A compilation frame-
work for lifelong program analysis & transformation.
In Proceedings of the CGO’04, page 75. IEEE Com-
puter Society, 2004.
[11] S. Li, J. H. Ahn, R. D. Strong, J. B. Brockman,
D. M. Tullsen, and N. P. Jouppi. Mcpat: an inte-
grated power, area, and timing modeling framework
for multicore and manycore architectures. In Pro-
ceedings of the 42nd MICRO, pages 469–480. ACM,
2009.
[12] T. H. Natalie Kitroeff. Boeing’s 737 max suffers set-
back in flight simulator test, 2019
[13] A. Suleiman, Z. Zhang, L. Carlone, S. Karaman, and
V. Sze. Navion: A 2-mw fully integrated real-time
visual-inertial odometry accelerator for autonomous
navigation of nano drones. IEEE Journal of Solid-
State Circuits, 2019.
[14] C. Tan, A. Kulkarni, V. Venkataramani,
M. Karunaratne, T. Mitra, and L.-S. Peh. Locus:
Low-power customizable many-core architecture for
wearables. TECS, 17(1):16, 2018.
[15] Y. Xiao, Y. Xue, S. Nazarian, and P. Bogdan. A
load balancing inspired optimization framework for
exascale multicore systems: A complex networks ap-
proach. In Proceedings of the 36th ICCAD, pages
217–224. IEEE Press, 2017.
[16] T. T. Ye, L. Benini, and G. De Micheli. Analysis
of power consumption on switch fabrics in network
routers. In Proceedings 2002 DAC, pages 524–529.
IEEE, 2002.
