Exploring Computation-Communication Tradeoffs in Camera Systems by Mazumdar, Amrita et al.
Exploring Computation-Communication Tradeoffs
in Camera Systems
Amrita Mazumdar∗, Thierry Moreau∗, Sung Kim†, Meghan Cowan∗,
Armin Alaghi∗, Luis Ceze∗, Mark Oskin∗, and Visvesh Sathe†
∗Paul G. Allen School of Computer Science & Engineering, University of Washington
†Department of Electrical Engineering, University of Washington
{amrita,moreau,cowanmeg}@cs.washington.edu, sungk9@uw.edu, {armin,luisceze,oskin}@cs.washington.edu, sathe@uw.edu
Abstract—Cameras are the defacto sensor. The growing de-
mand for real-time and low-power computer vision, coupled
with trends towards high-efficiency heterogeneous systems, has
given rise to a wide range of image processing acceleration
techniques at the camera node and in the cloud. In this paper,
we characterize two novel camera systems that use acceleration
techniques to push the extremes of energy and performance
scaling, and explore the computation-communication tradeoffs
in their design. The first case study targets a camera system
designed to detect and authenticate individual faces, running
solely on energy harvested from RFID readers. We design a
multi-accelerator SoC design operating in the sub-mW range,
and evaluate it with real-world workloads to show performance
and energy efficiency improvements over a general purpose
microprocessor. The second camera system supports a 16-camera
rig processing over 32 Gb/s of data to produce real-time 3D-360◦
virtual reality video. We design a multi-FPGA processing pipeline
that outperforms CPU and GPU configurations by up to 10×
in computation time, producing panoramic stereo video directly
from the camera rig at 30 frames per second. We find that an
early data reduction step, either before complex processing or
offloading, is the most critical optimization for in-camera systems.
I. INTRODUCTION
Cameras are the backbone of data processing for applications
ranging from social media and entertainment, to surveillance,
biomedical devices, and autonomous vehicles. As these systems
continue to specialize and diversify, the traditional interface
between camera sensors and general-purpose processors limits
optimization for extreme visual computing applications. Typi-
cally, architects employ one of two solutions to enable compute-
heavy vision: on-device hardware acceleration, or cloud offload.
Hardware accelerators achieve improved performance and
efficiency at the cost of fixed functionality, while offloading
data to the cloud relaxes computational constraints at the
cost of data communication. The design tradeoff reduces to
balancing computation and communication constraints for a
given visual computing workload. In this paper, we investigate
two end-to-end camera systems that push the boundaries of
energy efficiency and performance, under these lenses of
computation and communication costs. For each system, we
focus on holistically evaluating the full system and computation-
communication tradeoffs across parts of the processing system
via “in-camera processing pipelines.”
The first camera system is an ultra-low-power camera system
that recognizes specific users’ faces while running on harvested
Implementation: 
Cost: 
B4B1 B2 B3 B5
In camera Offloaded
ASIC
 C1 
 None
 0 
FPGA
C3 
CPU
C4  Cc 
Cloud
 0 
Core blocksOptional blocks
Fig. 1. Hypothetical in-camera pipeline with opportunities for acceleration.
This pipeline uses core blocks and some optional blocks, and offloads
computation to the cloud.
radio frequency (RF) energy. The ability to run untethered from
a power source makes deployment simple, but pushes the design
constraints to the extreme end of ultra-low-power design.
The second camera system assembles immersive stereoscopic
virtual reality (VR) video processing in real time, requiring
significantly more compute and communication performance.
The system consists of 16 4K-resolution cameras, processing
hardware, and a network link. To deliver real-time VR video,
the system processes up to 32 Gb/s, making it impractical to
transmit the sensor data to a data center for real-time stereo
processing and stitching.
These systems push the bounds of camera system engineer-
ing, at opposing ends of the design space of camera systems—
extreme low power, and extreme performance. Both systems
decompose to pipelines of application-specific computation
blocks, like the generic in-camera processing pipeline of
Figure 1. By characterizing two extreme design points in a
common framework, we highlight how these data movement
considerations are common across the spectrum of camera
applications. Many other camera systems are likely to exist
between the power-constrained and high-bandwidth case studies
we investigate.
This paper makes the following contributions:
• A low-power face authentication accelerator for energy
harvesting cameras, with an ASIC evaluation on real-world
workloads.
• A real-time stereoscopic video assembly accelerator for
virtual reality, with a CPU, GPU and FPGA comparison.
• A joint evaluation of computation and communication
costs, demonstrating how adding more computation can
reduce the overall cost of accelerator-based image pro-
cessing architectures.
ar
X
iv
:1
70
6.
03
86
4v
2 
 [c
s.A
R]
  1
6 O
ct 
20
17
II. BACKGROUND AND RELATED WORK
In-camera processing is not new, and prior work has intro-
duced many in-camera processors [17]. Our analysis applies
in-camera processing pipelines to two highly-constrained appli-
cations: low-power face authentication and real-time VR video
streaming. In this section, we discuss our general approach to
analyzing image processing pipelines and review notable related
work in computation offloading, image processing hardware,
and similar accelerator designs.
In-camera processing pipelines. To characterize in-camera
systems in a holistic way, we decompose camera applications
into processing pipelines and evaluate the system at the level
of functional block, as shown in Figure 1. Considering camera
systems at the block granularity helps us gain insight into
deciding what processing steps should be included at the camera
node, and what implementations (e.g., ASIC, FPGA, GPU)
meet an application’s requirements. In the hypothetical pipeline
of Figure 1, blocks B1, B3, and B4 may be processed in-camera
while the output of B4 is offloaded to a central processor such
as a multicore or cloud processor. The block B2 is shown
excluded from the pipeline because it does not improve the
overall cost. We define the total cost of the pipeline as the sum
of computation costs for in-camera blocks (C1, C3, and C4)
and the communication cost (Cc) of offloading the output of
B4. We assume the cost of computing in the cloud as “free”
(relative to computation in the camera) but the cost to get data
to the cloud is not (e.g., the camera expends energy to send
data). Hence, one can view the main objective of computing
in-camera is to minimize both the data communicated and the
computational cost.
In-camera processing pipelines can include core blocks
essential to the application, and optional blocks, which may not
directly affect results but can improve efficiency by filtering or
pre-processing data. One optional block is the motion detection
block we use in our face authentication pipeline. While the
core block of the pipeline, face authentication, operates on
every input frame, an optional motion detection block can
reduce the bandwidth and ensuing power consumption of core
blocks. Computation offload. Offloading image processing
computation from mobile devices to the cloud is well-explored
in mobile systems [30]. The opposing case for “onloading”
computation, or keeping computation at the sensor, has grown
more popular due to increased image processing demand and
privacy concerns [16, 22]. Our approach explores the tradeoff
space between offload and onload for two constrained camera
systems.
Vision-centric architectures. The rise of computer vision
and computational photography has inspired a number of
computer architectures for efficient image processing. Flexible
vision architectures [5, 7, 10] provide higher performance for
image processing and vision applications while maintaining
programmability. Mobile SoCs like Qualcomm’s Snapdragon
provide image processing functionality for mobile cameras [27].
Vasilyev et al. [35] argue towards programmable image
processing solutions, but find that custom ASICs are still
more energy efficient. Consequently, we choose to explore
fixed-function hardware to meet the constraints of our ultra-
low-power or high-performance application targets.
We consider different classes of image processing accelera-
tors for the computational blocks in our case studies; we now
detail related work in each class.
In-sensor processing. Image sensor data is typically cap-
tured as an analog signal and converted to a digital signal
for processing. Recent work investigated how to improve
application efficiency by moving some preliminary processing
into the analog domain at the sensor node. Centeye, for instance,
executes analog computation on image sensor signals [1]. Other
work computed early layers of convolutional NNs at the pixel
level [8, 21]. Processing can also be performed in the mixed-
signal domain [2].
Face detection accelerators. We investigate the use of a
face detection accelerator as an optional block to filter data in
a face authentication pipeline. Hardware acceleration for the
Viola-Jones face detection algorithm has been well-explored for
FPGAs and GPUs [9, 18, 19]. While Bong et al. also present
a neural network design using Haar filters as a first step, our
work performs a more holistic characterization to optimize the
full camera pipeline [6].
Neural network accelerators. NNs have been studied
extensively for accomplishing face detection and recogni-
tion [14, 29, 32]. Researchers are actively working to improve
NN performance with custom hardware [12, 13]. ShiDian-
Nao [11], specifically, is a CNN accelerator executed in-camera,
where the accelerator is placed on the same chip as the image
sensor processor, achieving 320mW power consumption.
Depth from stereo accelerators. Depth from stereo algo-
rithms and their implementations have been well-explored [31].
Stereo vision has been accelerated to real-time with GPUs and
FPGAs, but application targets are either very lower resolution
or perform badly on defocusing workloads [34, 40].
In-camera compression. Compressing sensor data incurs
computation–communication tradeoffs related to this paper’s
analyses. In our VR pipeline, for instance, the output of some
blocks might have a better data locality than the previous step,
facilitating high compression rates, but lossy compression at the
early stages of the pipeline could result in quality degradations.
While we do not explicitly consider compression in our study,
compression can be treated as an optional block in in-camera
processing pipelines.
III. CASE STUDY: LOW-POWER FACE AUTHENTICATION
In this section, we characterize a continuous vision pipeline
for face authentication based on the WISPCam platform [25], a
battery-free camera powered by harvested energy. Face authenti-
cation (FA) is a core workload in user-centric continuous mobile
B1 B2 B3
Motion 
detection
Face 
detection
Face 
authentication
WISPCam
Fig. 2. Face authentication with battery-free cameras.
SNNAP
DMA Master
Bus 
Scheduler
PU
SR
AM
control
PE
PE
SIG
... MUL MUL MUL MUL
weight weight weight weightd_in
ADD ADD ADD ADD
offset
88 88 88 88
16 16 16 16
acc.
fifo
sig.
fifosigmoid unit
26 26 26 26
26 26 26 26
acc
16
26
8 26
acc
PE0 PE1 PE2 PE3
8
d_out
Fig. 3. NN microarchitecture and processing element details.
vision systems. In these systems, a camera captures image
frames at a continuous frame rate, and an on-node processor
performs face recognition on each frame to identify a single
user. We define the core FA function as: given a test face and
a reference, decide if the test face matches the reference face.
The WISPCam-based system captures an image at 1 frame
per second (FPS) and transmits it over RF, powered by an
internal capacitor with harvested RF energy. We examine
how leveraging progressive filtering hardware can dramatically
reduce the power consumption of such a system and enable
continuous face authentication at low cost. We construct our
FA pipeline around NN-based face authentication, as shown in
Figure 2. The pipeline has one core block, the NN, and several
optional blocks. We evaluate a low-power NN accelerator
design, as well as the benefits of including motion detection
and a pre-processing face detection accelerator to reduce input
bandwidth to the NN. Because energy efficiency is a primary
concern, we design the accelerators to be integrated on-chip
with the camera sensor, and processed streaming through the
CSI2 camera serial interface.
We first discuss each accelerator design individually, present-
ing their microarchitectures and the tradeoffs we investigated
in each design’s algorithm and hardware implementation. We
then evaluate them together on a real-world face authentication
workload using real video we collected.
A. Neural network face authentication
For our face authentication task, we investigate a systolic
NN design, based on SNNAP, and explore tradeoffs in neural
network (NN) topology, accelerator geometry, and datapath
width reduction [24].
NN algorithmic tradeoffs. We first examine how modifying
NN topology affects both classification accuracy and energy
dissipation. We explore the search space by training NNs with
Fast Artificial Neural Network Library [26] and measuring the
achieved accuracy and energy cost.
Increasing the number of layers and neurons directly impacts
the memory and computational requirements of the NN.
Varying the input size to the NN has a direct impact on
performance and accuracy. Using a 5× 5 low-resolution input
window for face detection will lead to a cheap 25-neuron
input layer, but results in poor accuracy. The largest input
size our NN supports, 20× 20 pixels, preserves more details,
improving the accuracy of the NN classifier significantly. This
comes at a cost: halving classification error incurs an order-
of-magnitude increase in energy. From this exploration, we
select the topologies that give us an optimal accuracy/energy
for x in range(0, image_width):
for y in range(0, image_height):
faces += classify(x,y,window)
window *= scale_factor
if window > image_size:
return
(a) Algorithm pseudocode.
…stage 
1
stage 
2 stage 
20
feature
yes
yes
featurefeature
featurefeaturefeature
(15)
featurefeaturefeature
(53)
rectangular 
features
cascade 
classifier
(3)
(b) Cascading classifier.
Scale Factor Step Size (static) Step Size (adaptive)
0%
25%
50%
75%
100%
1.25 1.50 1.75 2.00 4 8 12 16 0.0 0.1 0.2 0.3 0.4
Algorithm Parameter
R
el
at
ive
 A
cc
ur
a
cy
Accuracy Metric F1 Score Precision Recall
(c) Impact of VJ parameters on relative accuracy.
Fig. 4. The face detection algorithm slides a window across an image and
repeatedly executes a classifier with stages of rectangular features.
compromise, a 400 − 8 − 1 NN topology with 400 inputs
neurons, 8 hidden neurons and 1 output neuron.
To evaluate accuracy tradeoffs, we trained a 400-8-1 NN
on 90% of LFW [20], a popular face recognition benchmark,
and tested its accuracy at recognizing a single person’s face
from the remaining 10%. Our evaluation indicates that with a
400−8−1 topology, we can achieve a 5.9% classification error
overall. As we discuss on our real-world evaluation, however,
our multi-stage approach and real-data workload lowers the
true miss rate of 0%, as the security workload presents many
less-challenging lighting and orientation scenarios.
NN microarchitecture. Our NN microarchitecture uses
a single processing unit with multiple processing elements.
Because our face authentication pipeline has wide layers, we
found that this design presented enough data parallelism to
keep functional unit utilization high for a single processing unit.
Figure 3 shows the datapath of a processing unit composed
of four 8-bit processing elements (PEs). A bus connects the
chain of processing elements to a sigmoid unit—a hardware
LUT-based approximation of a neuron’s activation function.
Each PE has its own weight memory that stores the synaptic
weights of the NN locally. The processing elements perform
multiply-add operations in a systolic fashion to evaluate the
matrix multiplication that composes each NN hidden and output
layer. A vertically micro-coded sequencer sends commands
to each processing element as inputs arrive and outputs are
produced to control data movement.
The NN hardware accelerator has a configurable number of
PEs, which we use to optimize the geometry of our accelerator.
We fix the frequency and voltage to 30MHz and 0.9V, and
explore the design tradeoffs between energy and throughput
using post-synthesis physical simulations. We find an energy-
optimal point at 8 PEs: any lower number of PEs introduces
scheduling inefficiencies, increasing energy consumption; too
many PEs results in underutilized resources and reduced
parallelism for the narrow network.
NN numerical accuracy tradeoffs. Power dissipation in the
memory and the PEs can be reduced by bit-width reduction.
We used fixed-point functional units and LUT-based approxi-
B1 B2 B3
B4
Pre-
processing Image 
alignment
Depth
estimation
Image 
stitching
B1 B2 B3
...
...
...
Camera rig VR Viewer
Cloud 
offload
Fig. 5. 3D-360◦ virtual reality video generation, capture and viewing devices.
mations of mathematical functions to minimize power and area.
We study the impact of two precision knobs on application
accuracy: (1) sigmoid approximation and (2) data bit-width. We
evaluate error as absolute classification accuracy loss relative
to a NN implemented with floating-point arithmetic and precise
mathematical functions. We then evaluate fixed-point precision,
limiting ourselves to powers of two for memory alignment.
After examining the effect of approximating the sigmoid
function with a simple 256-entry look-up table (LUT), we
conclude that hardware approximation of the sigmoid function
has a negligible effect on accuracy. For datapath width,
both 16-bit and 8-bit implementations of the NN accelerator
result in a small 0.4% accuracy loss relative to a precise
floating-point implementation. The 4-bit datapath however
displays a significant accuracy loss on average (over 1%).
The reduction in datapath width from 16-bit to 8-bit leads to a
41% power reduction for an 8-PE configuration, so we select
8-bit datapaths as the optimal energy-accuracy point for our
NN implementation.
B. In-camera face detection
The Viola-Jones (VJ) face detection algorithm is a popular
computer vision algorithm for fast, accurate face detection [37].
It is widely used in face authentication and other situations
where frontal faces are expected and speed is preferred. The
algorithm detects faces by scanning a window across the image,
evaluating simple rectangular features within the window at
each window position. If enough of these features are found
at a single window position, then that window is identified as
a face. To account for faces of different sizes in an image, the
scanning window is scaled and passed over the scene multiple
times. The VJ algorithm is well-known because of its simplicity
and efficiency, and continues to perform well against more
complex algorithms including deformable parts models and
convolutional NNs on face detection [23].
The VJ algorithm is popular specifically because of its high
efficiency in non-face windows – the algorithm optimizes to
spend more computation on windows where there is likely to
be a face, rather than executing a uniform computation at every
window. This optimization is encoded in the cascade classifier
structure illustrated in Figure 4b, a nested decision tree where
progressive levels have increasingly more features to evaluate,
and the simple stages must be evaluated positively first before
continuing on. The cascading computational style makes VJ a
good fit for a pre-filtering accelerator.
IV. CASE STUDY: REAL-TIME VIRTUAL REALITY VIDEO
In this section, we investigate the use of in-camera processing
for a high-performance, real-time panoramic stereo video
rendering application. As shown in Figure 5, the pipeline we
consider takes as input the high-resolution camera feeds from
a rig of cameras, like Google Jump [3], and processes the
images into a 360◦ stereo pair viewed on a VR viewer, such
as Google Cardboard [15]. The goal is to produce high-quality
video streams at a frame rate of 30 frames/sec or more.
Many VR video pipelines pipelines require users to upload
camera streams to a cloud service or high-performance com-
puting system—this workflow prevents real-time applications
such as live VR video streaming. While real-time hardware
systems for processing VR video are becoming commercially
available [33, 36], these solutions provide either live panorama
processing or stereoscopic 3D, not both. In our design, we
evaluate the performance constraints of this multi-step pipeline
and investigate how much in-camera processing is required
to achieve real-time VR video generation. We evaluate how
processing at the camera node reduces the bandwidth required
for offloading, and how hardware acceleration facilitates a
real-time VR system.
Camera rigs for recording stereoscopic panorama videos
capture a multi-camera scene and compute a depth map for each
pair of cameras in the rig. These depth maps are composited
together from multiple pairwise-camera pipelines into a single
3D-360◦ video. For our application, we seek to meet a real-
time frame rate of 30 frames/sec, so we optimize our design for
the cost of throughput. We define the communication cost as
the bandwidth in and out of each block. Since all the pipeline
blocks and offloading can be pipelined, the slowest step will
dominate overall throughput. Among the blocks shown in
Figure 5, the depth estimation step has the lowest bandwidth
and throughput. In this section, we describe the depth estimation
algorithm used for this block, how we map the algorithm
to a high-throughput accelerator, and evaluate the system’s
computation-communication tradeoffs towards real-time results.
0
20
40
60
80
100
20 40 60 80 100
0
2
4
6
8
20
40
60
80
100
0 20 40 60 80 100
20
40
60
80
0 20 40 60 80 100
20
40
60
80
0 20 40 60 80 100
c) input mapped to a 
    bilateral grid
d) after smoothing in the 
    bilateral grid
a) input signal b) after smoothing
Fig. 6. The bilateral filter is an edge-aware filter.
60%
70%
80%
90%
100%
0 100 200 300 400 500
Bilateral Grid Size (GB)
Qu
ali
ty 
(M
S−
SS
IM
) Megapixels 5 MP 7 MP 8 MP
Fig. 7. Using a smaller bilateral grid is cheaper to compute but degrades the
quality of the output depth map, even at high image resolutions.
A. Depth maps from bilateral-space stereo
We base our design for fast and accurate stereo processing on
the state-of-the-art bilateral-space stereo algorithm (BSSA) [4].
Typically, global stereo algorithms generate a depth map from
a pair of images by computing a rough disparity, or difference
in space, between pixels, and then refining that disparity until
a cost function has been minimized [31]. Instead of computing
disparities per-pixel, BSSA resamples the problem into a
different representation, bilateral-space, before computing the
disparity. In the bilateral domain, simple local filters are
equivalent to costly, global edge-aware filters in pixel-space—
consequently, disparity refinement is much faster in bilateral
space. We perform BSSA in a bilateral grid data structure,
where pixels are mapped to a grid vertex, or bin, in bilateral-
space. Filtering in the bilateral grid results in faster, higher-
quality output than comparable techniques [4].
We illustrate the operation of a bilateral filter in Figure 6. For
simplicity, we demonstrate a 1D signal, instead of a 2D image
signal. Our stereo algorithm seeks to smooth the noisy signal
of Figure 6a, which has a sharp edge. Applying a 1D moving
average on Figure 6a results in Figure 6b, which has less noise
and a smoothed-out edge. A bilateral filter performs the same
smoothing operation while preserving the edge of Figure 6a.
The signal is mapped to bilateral space as in Figure 6c, where
neighboring pixels with significantly different intensity values
will have a large distance in 2D-space. Smoothing this signal
in the bilateral domain with a 2D moving average allows the
signal to maintain edges. Figure 6d shows the result after
filtering in bilateral-space.
Instead of a simple filter like moving average, BSSA maps
a noisy depth map to a bilateral grid, refines the depth map by
solving an optimization problem, and remaps the bilateral-grid
result to pixel-space. Varying the number of pixels that map to
Compute Units
DMA
HDMI 
Core
HDMI 
Core
CPU
Inter-
connect
HP
Inter-
connect
Memory
 Interconnect
Ethernet 
Core
L
R
+
x
+
<<< 3
+ + +
x
x
ARM 
Core
Controller
Fig. 8. VR accelerator architecture on a Xilinx Zynq SoC.
● ●
●
●
●100
200
300
400
500
Sensor B1 B2 B3 B4
Im
ag
e 
Ou
tp
ut
fro
m
 B
loc
k(
M
B)
B1 B2 B3 B4
pre-
processing
5%
image 
alignment
20%
depth 
estimation
70%
image 
stitching
5%
computation 
time:
Fig. 9. Computation distribution and output data size for blocks in a VR
video pipeline (2 of 16 cameras).
a grid vertex impacts the time to compute the stereo refinement
for a frame, and also the quality of the depth map. Figure 7
demonstrates the tradeoff between stereo image quality and
bilateral grid size to be processed for high-resolution input
images. Here, we scaled bilateral grid sizes from 4 pixels-per-
grid-vertex to 64 in each of three dimensions in a bilateral
grid and evaluated the resulting impact on quality using MS-
SSIM [38]. We find the resolution of the input images is less
impactful than choosing an appropriate grid size to balance
quality and computational complexity.
B. BSSA accelerator design on FPGAs
We design and implement our processing flow in Verilog
on the Xilinx Zynq-7020 SoC [39]. Figure 8 depicts the high-
level architecture of our system. We implement the initial
full pipeline in software to run on the Zynq’s CPU, and then
design an AXI-Stream-compliant FPGA accelerator for depth
refinement that can be invoked by the software. The CPU
prepares the bilateral-grid data structure with pixels mapped
to grid vertices, and transfers them via DMA to the FPGA
fabric. The hardware accelerator processes the vertices with
the bilateral-space filtering and streams them back to the CPU,
where the bilateral-grid-filtered result is converted into the
fully-processed depth map.
Figure 9 shows the processing break-down for our pipeline
in time consumption and the image data size produced by each
block. We find that the depth estimation block, B3, consumes
the greatest computation time as well as the largest amount of
data, from B2. We thus focus on applying FPGA acceleration
to this block, and then evaluate the impact of accelerating this
block on pipeline throughput.
Applying the computation of B3 to a high-resolution video
is equivalent to applying millions of blurs to the bilateral grid
representation of the video frames. Across a single frame, most
of these filters can run in parallel, so we designed streaming
compute units to run bilateral filters on a stream of grid
vertices. We find that BSSA requires at least 32-bit floating-
point precision to produce high-quality depth maps, and use
DSP units on the FPGA fabric to compute efficient floating-
point operations. Each compute unit requires 18 DSP units in
our design, so we can scale up to 12 parallel compute units on
the ZC702. However, we project that if we scale up to a top-
of-the-line Xilinx Virtex UltraScale+ FPGA, we can parallelize
15.815.8 15.815.8
3.953.95
0.09
5.27
0.09
5.275.27
11.2
5.275.27
0.09
31.6
0.09
11.2
31.6
11.2
31.631.6
0
10
20
30
S~ SB1~ SB1B2~ SB1B2B3C~ SB1B2B3G~ SB1B2B3F~ SB1B2B3CB4C~SB1B2B3GB4G~SB1B2B3FB4F~
FP
S 
Up
loa
de
d
a a acompute communication total
pipeline 
config
sensor sensor + B1 sensor + B1 + B2
sensor + B1 + 
B2 + B3 (CPU)
sensor + B1 + 
B2 + B3 (GPU)
sensor + B1 + 
B2 + B3 (FPGA)
sensor + B1 + 
B2 + B3 + B4 
(CPU)
sensor + B1 + 
B2 + B3 + B4 
(GPU)
sensor + B1 + 
B2 + B3 + B4 
(FPGA)
B1 B1 B2 B1 B2 B3 B1 B2 B3 B1 B2 B3 B4B1 B2 B3 B4B1 B2 B3 B4B1 B2 B3
100 100100 174 174
Fig. 10. Pipeline configurations with different bilateral smoothing implementations (CPU, GPU, FPGA), and resulting upload rates (frames per second). Only
the full pipeline with FPGA acceleration can meet a 30 FPS upload requirement.
TABLE I
REQUIREMENTS FOR FPGA ACCELERATION PLATFORM.
Resource Evaluation Target
System FPGA Model Zynq-7000 Virtex UltraScale+
FPGA (#) 1 16
Cameras 2 16
Per FPGA Logic 45.91% 67.10%
RAM 6.70% 17.60%
DSP 94.09% 99.98%
Clock (MHz) 125 125
up to 682 compute units, which are more than enough for real-
time operation. Table I summarizes the setup we use in our
evaluation and resource requirements for real-time performance
with a 16-camera system.
C. Evaluation
Experimental setup. We compare our FPGA results on the
Zynq platform to CPU and GPU baselines. The Zynq includes
a Dual ARM Cortex-A9 and a Xilinx FPGA, all fabricated
at TSMC 28nm technology. We implement the CPU baseline
on the Zynq’s Dual ARM Cortex-A9 as a proxy for a mobile-
grade CPU, and evaluate the GPU on an NVIDIA Quadro
K2200. Both baselines execute optimized BSSA code written
and tuned with Halide [28].
Methodology. We consider the throughput of the data output
as the “communication cost” for offloading, and the cost to
compute the pipeline block as the “computation cost”. We treat
the communication cost as fixed for each block; it is simply
the cost of offloading the data from each block, as shown in
Figure 9. For all blocks except disparity refinement, we assume
the computation cost to be the compute time evaluated using
the ARM CPU baseline’s performance numbers. We average
the compute time for the disparity refinement block over five
executions of the kernel over a frame. Because this processing
flow can be pipelined across frames in a video stream, the
“total cost” of the system can be considered to be dominated
by the lowest-throughput block of the system.
Computation-communication tradeoffs. Figure 10 shows
the runtime results of different pipeline configurations, uploaded
on a networked connection to a viewing device supporting at
least 30 FPS. We seek to uncover scenarios in which both
computation and communication surpass our minimum frame
rate of 30 FPS—if one or both costs falls below the threshold,
the system cannot support real-time operation.
For the first three scenarios, the cost of doing little compu-
tation before offloading is cheap, even on the ARM core, but
the communication cost for the raw captured data falls short
of our 30 FPS threshold. Computing the disparity refinement
in B3 is more costly, and the CPU and GPU implementations
are not fast enough to support real-time operation. Moreover,
the cost of offloading the computed depth maps before image
stitching is significantly lower.
The computation cost of image stitching in B4 is marginal
compared to BSSA, as well, and the resulting FPS is virtually
the same. The data size to communicate after B4, however,
is much smaller, as illustrated in Figure 9, and is the only
data size small enough to support real-time uploading. We find
that the configuration with all the blocks processed in-camera
and B3 mapped to the FPGA is the only configuration where
both computation and communication pass the threshold and
support real-time processing.
Our analysis indicates that this camera system is primarily
constrained by network bandwidth. For our evaluation, we
assumed transfer speeds of 25 Gigabit Ethernet. As network
connections grow faster, our results will trend towards off-
loading computation right off the sensor. For instance, at a
hypothetical ultra-high-throughput network link of 400-Gb
Ethernet, the 16-camera output can be uploaded at 395 FPS,
reducing the efficiency incentive for in-camera processing in
this scenario.
V. CONCLUSIONS
Cameras have become the dominant sensor in mobile
systems, and complex image processing pipelines are now
standard. In this paper, we use the notion of “in-camera
processing pipelines” to thoroughly characterize the design
of two camera systems at the extreme ends of the energy
and performance scaling limits of current hardware. Our face
authentication camera system, for instance, runs entirely on
harvested energy, pushing the limits of ultra-low power compu-
tation. Our virtual reality camera system requires significantly
more in-camera processing and data communication resources
than traditional imaging platforms. Our results highlight how
design parameters for individual accelerators can influence
the full-system execution behavior, as well as shape decisions
about whether to process a computation block in the camera
or offload the computation.
We characterize in detail how even the most power-efficient
neural network design performs significantly better when
adding computation earlier in the pipeline to effectively filter
the image data. Our VR pipeline highlights how computational
stages that expand the data size are inefficient in isolation,
and can be better optimized in concert with their down-stream
components.
Power and performance constraints require increasingly
efficient computational platforms, and architects will continue
to look to hardware acceleration to enable challenging applica-
tions. As we demonstrate in this paper, even tightly-optimized
accelerators can fail to improve performance if they fail to
consider full-system communication challenges. Given the
growth of image data production and requirements of modern
vision and graphics algorithms, future applications require a full
system approach to maintain power and performance efficiency
of camera designs.
VI. ACKNOWLEDGEMENTS
We thank members of the Sampa lab and the anonymous
reviewers for their feedback on earlier versions of this work.
This work was supported in part by the National Science
Foundation under Grant CCF-1518703, by C-FAR, one of the
six SRC STARnet Centers, sponsored by MARCO and DARPA,
and a generous gift from Google.
REFERENCES
[1] “Vision chips,” http://www.centeye.com/technology/
vision-chips/, accessed: 2017-06-06.
[2] A. Alaghi, C. Li, and J. Hayes, “Stochastic circuits for
real-time image-processing applications,” in Design Au-
tomation Conference (DAC), 2013 50th ACM/EDAC/IEEE,
May 2013.
[3] R. Anderson, D. Gallup, J. T. Barron, J. Kontkanen,
N. Snavely, C. Herna´ndez, S. Agarwal, and S. M. Seitz,
“Jump: Virtual reality video,” SIGGRAPH Asia, 2016.
[4] J. T. Barron, A. Adams, Y. Shih, and C. Hernandez, “Fast
bilateral-space stereo for synthetic defocus,” in The IEEE
Conference on Computer Vision and Pattern Recognition
(CVPR), June 2015.
[5] B. Barry, C. Brick, F. Connor, D. Donohoe, D. Moloney,
R. Richmond, M. O’Riordan, and V. Toma, “Always-on
vision processing unit for mobile applications.” IEEE
Micro, vol. 35, no. 2, pp. 56–66, 2015.
[6] K. Bong, S. Choi, C. Kim, S. Kang, Y. Kim, and H.-J.
Yoo, “A 0.62 mw ultra-low-power convolutional-neural-
network face-recognition processor and a cis integrated
with always-on haar-like face detector,” in Solid-State
Circuits Conference (ISSCC), 2017 IEEE International.
IEEE, 2017, pp. 248–249.
[7] N. Chandramoorthy, G. Tagliavini, K. Irick, A. Pullini,
S. Advani, S. A. Habsi, M. Cotter, J. Sampson,
V. Narayanan, and L. Benini, “Exploring architectural
heterogeneity in intelligent vision systems,” in 2015
IEEE 21st International Symposium on High Performance
Computer Architecture (HPCA), Feb 2015, pp. 1–12.
[8] H. G. Chen, S. Jayasuriya, J. Yang, J. Stephen, S. Sivara-
makrishnan, A. Veeraraghavan, and A. Molnar, “Asp
vision: Optically computing the first layer of convolutional
neural networks using angle sensitive pixels,” in The IEEE
Conference on Computer Vision and Pattern Recognition
(CVPR), June 2016.
[9] J. Cho, B. Benson, S. Mirzaei, and R. Kastner, “Par-
allelized Architecture of Multiple Classifiers for Face
Detection,” in 20th IEEE International Conference on
Application-specific Systems, Architectures and Proces-
sors, 2009. ASAP 2009, Jul. 2009, pp. 75–82.
[10] J. Clemons, C.-C. Cheng, I. Frosio, D. Johnson, and
S. W. Keckler, “A patch memory system for image
processing and computer vision,” in 2016 49th IEEE/ACM
International Symposium on Microarchitecture, 2016.
MICRO-49, Oct. 2016.
[11] Z. Du, R. Fasthuber, T. Chen, P. Ienne, L. Li, T. Luo,
X. Feng, Y. Chen, and O. Temam, “ShiDianNao: Shifting
Vision Processing Closer to the Sensor,” in Proceedings of
the 42Nd Annual International Symposium on Computer
Architecture, ser. ISCA ’15. New York, NY, USA: ACM,
2015, pp. 92–104.
[12] C. Farabet, B. Martini, P. Akselrod, S. Talay, Y. LeCun,
and E. Culurciello, “Hardware accelerated convolutional
neural networks for synthetic vision systems,” in Cir-
cuits and Systems (ISCAS), Proceedings of 2010 IEEE
International Symposium on, May 2010, pp. 257–260.
[13] C. Farabet, B. Martini, B. Corda, P. Akselrod, E. Culur-
ciello, and Y. LeCun, “NeuFlow: A runtime reconfigurable
dataflow processor for vision,” in 2011 IEEE Computer
Society Conference on Computer Vision and Pattern
Recognition Workshops (CVPRW), Jun. 2011.
[14] C. Garcia and M. Delakis, “Convolutional face finder:
A neural architecture for fast and robust face detection,”
Pattern Analysis and Machine Intelligence, IEEE Trans-
actions on, vol. 26, no. 11, pp. 1408–1423, 2004.
[15] Google, “Cardboard - google vr,” https://vr.google.com/
cardboard/, accessed: 2017-06-06.
[16] S. Han and M. Philipose, “The case for onloading
continuous high-datarate perception to the phone,” in
Proceedings of the 14th USENIX Conference on Hot
Topics in Operating Systems, ser. HotOS’13. Berkeley,
CA, USA: USENIX Association, 2013.
[17] J. Hauswald, T. Manville, Q. Zheng, R. Dreslinski,
C. Chakrabarti, and T. Mudge, “A hybrid approach to
offloading mobile image classification,” in Acoustics,
Speech and Signal Processing (ICASSP), 2014 IEEE
International Conference on. IEEE, 2014, pp. 8375–
8379.
[18] D. Hefenbrock, J. Oberg, N. T. N. Thanh, R. Kastner,
and S. B. Baden, “Accelerating viola-jones face detection
to fpga-level using gpus,” in 2010 18th IEEE Annual
International Symposium on Field-Programmable Custom
Computing Machines. IEEE, 2010.
[19] M. Hiromoto, K. Nakahara, H. Sugano, Y. Nakamura,
and R. Miyamoto, “A Specialized Processor Suitable
for AdaBoost-Based Detection with Haar-like Features,”
in IEEE Conference on Computer Vision and Pattern
Recognition, 2007. CVPR ’07, Jun. 2007, pp. 1–8.
[20] G. B. Huang, M. Ramesh, T. Berg, and E. Learned-Miller,
“Labeled faces in the wild: A database for studying face
recognition in unconstrained environments,” University
of Massachusetts, Amherst, Tech. Rep. 07-49, October
2007.
[21] R. LiKamWa, Y. Hou, J. Gao, M. Polansky, and L. Zhong,
“Redeye: Analog convnet image sensor architecture for
continuous mobile vision,” in Proceedings of the 43nd An-
nual International Symposium on Computer Architecture
(ISCA). New York, NY, USA: ACM, 2016.
[22] R. LiKamWa, Z. Wang, A. Carroll, F. X. Lin, and
L. Zhong, “Draining our glass: An energy and heat
characterization of google glass,” in Proceedings of 5th
Asia-Pacific Workshop on Systems, ser. APSys ’14. New
York, NY, USA: ACM, 2014.
[23] M. Mathias, R. Benenson, M. Pedersoli, and L. Van Gool,
“Face detection without bells and whistles,” in ECCV 2014,
2014.
[24] T. Moreau, M. Wyse, J. Nelson, A. Sampson, H. Es-
maeilzadeh, L. Ceze, and M. Oskin, “SNNAP: Approx-
imate Computing on Programmable SoCs via Neural
Acceleration,” in International Symposium on High-
Performance Computer Architecture (HPCA), 2 2015.
[25] S. Naderiparizi, Y. Zhao, J. Youngquist, A. P. Sample,
and J. R. Smith, “Self-localizing battery-free cameras,”
in Proceedings of the 2015 ACM International Joint
Conference on Pervasive and Ubiquitous Computing, ser.
UbiComp ’15. ACM, 2015.
[26] S. Nissen, “Implementation of a fast artificial neural
network library (fann),” Department of Computer Science
University of Copenhagen (DIKU), Tech. Rep., 2003,
http://fann.sf.net.
[27] Qualcomm, “Snapdragon,” https://www.qualcomm.com/
products/snapdragon/processors, accessed: 2017-06-06.
[28] J. Ragan-Kelley, C. Barnes, A. Adams, S. Paris, F. Durand,
and S. Amarasinghe, “Halide: a language and compiler
for optimizing parallelism, locality, and recomputation
in image processing pipelines,” ACM SIGPLAN Notices,
vol. 48, no. 6, pp. 519–530, 2013.
[29] H. A. Rowley, S. Baluja, and T. Kanade, “Neural network-
based face detection,” IEEE TRANSACTIONS ON PAT-
TERN ANALYSIS AND MACHINE INTELLIGENCE,
vol. 20, no. 1, pp. 23–38, 1998.
[30] M. Satyanarayanan, “Pervasive computing: vision and
challenges,” IEEE Personal Communications, vol. 8, no. 4,
pp. 10–17, Aug 2001.
[31] D. Scharstein and R. Szeliski, “A taxonomy and evaluation
of dense two-frame stereo correspondence algorithms,”
International journal of computer vision, vol. 47, no. 1-3,
pp. 7–42, 2002.
[32] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf, “Deep-
face: Closing the gap to human-level performance in face
verification,” in Computer Vision and Pattern Recognition
(CVPR), 2014 IEEE Conference on, June 2014.
[33] Teradek, “Sphere - real-time 360 monitoring and live
streaming,” http://teradek.com/collections/sphere-family/,
accessed: 2017-06-06.
[34] C. Ttofis and T. Theocharides, “High-quality real-time
hardware stereo matching based on guided image filtering,”
in Proceedings of the Conference on Design, Automation
& Test in Europe, ser. DATE ’14. 3001 Leuven, Belgium,
Belgium: European Design and Automation Association,
2014, pp. 356:1–356:6.
[35] A. Vasilyev, N. Bhagdikar, A. Pedram, S. Richard-
son, S. Kvatinsky, and M. Horowitz, “Evaluating pro-
grammable architectures for imaging and vision applica-
tions,” in 2016 49th IEEE/ACM International Symposium
on Microarchitecture, 2016. MICRO-49, Oct. 2016.
[36] Videostitch, “Vahana vr,” http://www.video-stitch.com/
live-vr/, accessed: 2017-06-06.
[37] P. Viola and M. J. Jones, “Robust real-time face detection,”
International journal of computer vision, vol. 57, no. 2,
2004.
[38] Z. Wang, E. P. Simoncelli, and A. C. Bovik, “Multiscale
structural similarity for image quality assessment,” in Sig-
nals, Systems and Computers, 2004. Conference Record
of the Thirty-Seventh Asilomar Conference on, vol. 2.
Ieee, 2003.
[39] Xilinx, “Socs & mpsocs,” http://www.origin.xilinx.com/
products/silicon-devices/soc.html, accessed: 2017-06-06.
[40] Q. Yang, “Hardware-efficient bilateral filtering for stereo
matching,” IEEE Transactions on Pattern Analysis and
Machine Intelligence, vol. 36, no. 5, May 2014.
