GPU-accelerated real-time stixel computation by Hernandez-Juarez, Daniel et al.
GPU-accelerated real-time stixel computation
D. Hernandez-Juarez1, A. Espinosa1, D. Va´zquez2, A. M. Lo´pez2, and
J. C. Moure1
1Computer Architecture & Operating Systems Department (CAOS) at Universitat Autonoma de Barcelona
2Computer Vision Center
October 14, 2016
Abstract
The Stixel World is a medium-level, compact represen-
tation of road scenes that abstracts millions of disparity
pixels into hundreds or thousands of stixels. The goal
of this work is to implement and evaluate a complete
multi-stixel estimation pipeline on an embedded, energy-
efficient, GPU-accelerated device. This work presents
a full GPU-accelerated implementation of stixel estima-
tion that produces reliable results at 26 frames per sec-
ond (real-time) on the Tegra X1 for disparity images
of 1024×440 pixels and stixel widths of 5 pixels, and
achieves more than 400 frames per second on a high-end
Titan X GPU card.
1 Introduction
Advanced driver assistance systems (ADAS), au-
tonomous vehicles, robots and other intelligent devices
can estimate the distance of objects and the free space
in a given scene by computing depth information from
stereo camera systems or LIDARs. The large amount of
low-level per-pixel depth data is very costly to process
and commonly a medium-level representation known as
the stixel world [1] is used for road scenes. It relies on
the fact that man-made environments mostly present hor-
izontal and vertical planar surfaces, like roads, sidewalks
or soil (horizontal), and buildings, pedestrians or cars
(vertical).
Stixels are segments of image columns that represent
obstacles. They provide a compact representation that
converts millions of disparity pixels to hundreds or thou-
sands of stixels. Pfeiffer and Franke [12] proposed an ex-
tended representation that allows multiple stixels per col-
umn, providing a richer representation of the scene (See
Fig. 1). Stixels are the basis for multiple extensions such
as tracking [11], grouping [5] or semantics [15], and also
serves as the input for further processing, like pedestrian
detection [2].
Calculating stixels is a complex task, comparable to
that of generating dense stereo information. As a conse-
quence, the algorithm implemented on a multi-core CPU
by [10] does not fulfill the real-time nor the energy-
efficiency requirements of autonomous driving applica-
tions. Dedicated hardware designs (e.g. FPGA or ASIC)
may achieve these goals, but are very inflexible and costly
regarding changes in the algorithms, like combining stix-
els and semantic segmentation [15]. We explore GPU ac-
celeration as an alternative.
The appearance of embedded GPU-accelerated sys-
tems, like the NVIDIA Jetson TX1 and DrivePX plat-
forms, opens the door for low-cost and low-energy con-
sumption, real-time stixel computation. GPUs are very
well suited for algorithms exhibiting massive and embar-
rassing parallelism, but may suffer high performance in-
efficiencies with algorithms that contain inherent depen-
dencies, as those using dynamic programming techniques
like the stixel algorithm presented by [12]. Careful work
distribution and task cooperation, coupled with an ap-
propriate data layout design, may overcome those diffi-
culties and achieve competitive performance. Recently,
1
ar
X
iv
:1
61
0.
04
12
4v
1 
 [c
s.C
V]
  1
3 O
ct 
20
16
Figure 1: Stixel world: taking the dense disparity map as the input, and estimating a certain ground slope and horizon
line, columns are segmented into stixels and classified into ground, object and sky categories.
[7] proved that GPU-acceleration of dense stereo compu-
tation using Semi Global Matching -mostly based on a
1D dynamic programming algorithm- can be successfully
achieved.
The objective of this work is to implement and evaluate
a GPU-accelerated software implementation1of a com-
plete multi-stixel estimation pipeline, a missing work up
to the best of our knowledge. We discuss the optimized
massively parallel schemes and data layouts of each of
the algorithms involved. Our proposal runs on a single
Tegra X1 chip almost two times faster (26 fps versus 13.3
fps) and achieves 25 times better performance per watt
ratio than the multicore implementation in [10], for the
same disparity image size (1024×440 px) and stixel width
(5 px), providing the same high-quality results. The pro-
posed design achieves 413 fps on a high-end Titan X GPU
card, more than 30 times faster than [10] with a similar
energy envelope.
The remainder of this paper is structured as follows.
Section 2 reviews the state of the art on stixel computa-
tion. Section 3 reviews stixel formulation. Section 4 ex-
plains general basic concepts of GPU optimization while
section 5 explains our proposed GPU-based optimizations
for real-time stixel computation. Finally, in section 6
the accuracy and performance of our proposed method is
evaluated.
1https://github.com/dhernandez0/stixels
2 Related work
The stixel world was introduced by Badino et al. [1] as an
intermediate representation suitable for high-level tasks
such as object detection or tracking [2, 11]. The ground
surface of a scene, also called free-space (space without
obstacles), is estimated using [17] from a depth map com-
puted with an stereo algorithm such as SGM [8]. Then,
dynamic programming is employed in order to find the
bottom of the stixels and their heights.
In order to achieve real-time stixel estimation on CPU,
Benenson et al. [3] developed a less accurate method to
calculate the ground surface by accumulating matching
costs into vertical-disparity space, and then computing the
bottom and height of stixels in disparity-cost space. Dis-
parities are computed only for object stixels and not for
ground.
The work from [1, 3] considers only a single stixel per
column, which represents an incomplete world model that
e.g. cannot represent a pedestrian and a building on the
same column. Pfeiffer and Franke [12] developed a uni-
fied probabilistic approach, also solved with dynamic pro-
gramming, that considers the occurrence of multiple stix-
els per column.
The stixel world is the fundamental building block for
more informative representations. Some extensions are
stixel tracking, which provides a vector of velocity for
each stixel [11]; stixel grouping, where stixels that seem
2
to belong to the same object are grouped together [5]; and,
finally, semantic stixels represent a combination of the
stixel world and semantic segmentation [15]. All these ex-
tensions would greatly benefit from real-time stixel com-
putation.
Muffert et al. [9] claim to run their FPGA implemen-
tation at 25 fps with stixel widths of 5 px, but the authors
do not indicate the image resolution.
3 The Stixel World
Stixels provide a compact representation of man-made
scenes, mainly roads, modeled as horizontal and verti-
cal surfaces, which identify the ground and the objects
on the scene, such as buildings, pedestrians, cars, or traf-
fic lights. We follow the stixel world model defined in
[10, 12], where the reader can find more details.
Stixels are segments of image columns, with a certain
height and distance, that are classified as ground, object or
sky. Fig. 1 shows an example with a detailed column that
is segmented into 5 different stixels. A comprehensive
example can be found at the end of the paper, in Fig. 6.
A hard assumption of the model is that stixels in differ-
ent columns are independent. The ground slope and hori-
zon line are assumed to be known for each image. The
stixel computation problem is addressed using a unified
probabilistic approach that incorporates real-world con-
straints such as perspective ordering, and is formulated as
a MAP estimation problem that delivers an optimal seg-
mentation of the columns with respect to the free-space
and obstacle information [12]. A Dynamic Programming
(DP) solving scheme incorporates the prior knowledge in
order to minimize the global cost of the solution.
The following constraints are modeled as prior proba-
bilities: (1) bayesian information criterion (BIC): there
is a small number of objects in the scene; (2) gravity
constraint: flying objects are unusual; (3) ordering con-
straint: the upper of two staggered stixels classified as
objects is expected to be further away; (4) staggering con-
straint: some configurations, like having a ground stixel
above a sky stixel, are impossible; and (5) diving con-
straint: the base point depth of an object stixel should be
equal or greater than the corresponding ground depth.
Next we will formalize the problem, define the DP re-
cursive equation, and discuss some algorithmic optimiza-
tions.
3.1 Formal description of MAP problem
The original disparity image D of w×h pixels is pre-
processed to convert each column of width s into a col-
umn of a single pixel, resulting in a reduced disparity im-
age D
′
of ws×h pixels. Since all columns are processed
independently, we restrict our description to finding the
optimal stixel segmentation of a single column Di.
A stixel is a segment sn = {vbn, vtn, cn}, where vbn and
vtn are, respectively, the base (beginning) and top (ending)
positions in the column (0≤vbn≤vtn≤h), and cn is a label
from class C = {object, ground, sky}.
Each stixel class defines a different theoretical model
for the disparity of each pixel v along the stixel, vbn ≤
v ≤ vtn. This model is defined as a function f :
• The function for ground stixels is defined to
match the disparity gradient of the ground surface:
fground(v) = α(vhorizon − v)
• The function for sky stixels is zero (modeling very
far away pixels): fsky(v) = 0
• The function for object stixels is supposed to have
constant disparity, but this value depends on the par-
ticular object corresponding to the stixel. We model
the function as fobjectn (v) = fn, and this constant
value is computed as the mean of the measured dis-
parities of the considered stixel.
L∈L is an ordered list of N consecutive, adjacent, and
non-overlapped stixels {sn}, 1≤n≤N≤h, where L is the
set of possible segmentations. Then, the stixel computa-
tion problem is formulated as a Maximum A Posteriori
(MAP) estimation problem:
L∗ = argmax
L∈L
P (L|Di) (1)
3.2 Data & Smoothness terms
Applying the Bayes’ theorem, the posterior probability
P (L|Di) in Eq. 1 can be rewritten as P (L|Di) ≈
P (Di|L)P (L), where the first term, or data term, is the
conditional probability of having the input column Di
3
given a labeling L, and the second term, or smoothness
term, is the prior probability of the configuration L.
It is assumed that the data term can be computed inde-
pendently for each pixel in the stixel and for all the stixels
in the segmentation. This allows to express the data term
with the following formula:
P (Di|L) =
N∏
n=1
vtn∏
v=vbn
PDi(dv|sn, v) (2)
The data term PDi(dv|sn, v) models the probability for
a single disparity measurement dv at row v to belong to
a given stixel sn. A sensor model is used to estimate the
likelihood (or cost) that the measured disparity data (dv)
matches the theoretical disparity model corresponding to
a given stixel (f(v)). Following the proposal in [10], we
define this model as a combination of a Gaussian and a
Uniform distribution.
PDi(dv|sn, v) = pout
drange
+
1− pout
Anorm
e−
1
2 (
dv−f(v)
σcn (f,v)
)2 (3)
The uniform distribution models the probability of the
occurrence of an outlier (pout is the outlier rate), some-
times due to non-valid disparity measurements or by in-
correctly matched pixels on the scene. drange is the to-
tal number of disparities considered in the input disparity
map. The Gaussian distribution assesses the affinity of
the measured disparity with the theoretical disparity func-
tion of the stixel. Anorm is a normalization term, and
σcn(f, v) is a sigmoid function that incorporates the noise
model for the disparity measurement.
In order to avoid numerical problems with small mag-
nitudes of the individual probabilities and to simplify Eq.
3, the MAP estimation problem is expressed using the
logarithm of the likelihoods instead of the actual likeli-
hood. The optimization problem is then converted to a
cost minimization problem. Following is the final equa-
tion to model the data term cost of a given pixel v be-
longing to a given stixel sn. The data term cost of a
stixel is the aggregation of the costs of all of its pixels,
Cdata(sn) =
∑vtn
v=vbn
Cdata(dv|sn, v)
Cdata(dv|sn, v) = min(log(drange)− log(pout),
log(Anorm) + log(σcn(f, v)×
√
2pi)− log(1− pout)
+
1
2σ2cn(f, v)
(dv − f(v))2
(4)
The prior probability, or smoothness term, models
the real-world constraints described at the beginning of
this section, and is defined as a set of cost tables (log-
likelihoods instead of actual probabilities). These con-
straints only consider the likelihood of the first stixel and
the pairwise mutual dependencies of a pair of adjacent
stixels Cprior(sn, sn−1). In our proposal we use the same
model and parameters described in [10, 12].
3.3 Solving Stixels with Dynamic Program-
ming
Dynamic Programming solves problems by breaking
them down to simpler subproblems and storing the partial
solutions on a memory structure. This way, when a given
subproblem appears again, computation time is saved by
retrieving the partial solution from the memory structure
and not by solving the same subproblem repeatedly.
The dynamic programming scheme is used for comput-
ing the stixel segmentationL= {sn}with minimum global
cost for a column Di. The global cost is composed of a
data term Cdata(L) =
∑
sn∈L Cdata(sn) and a smooth-
ness term Cprior(L) =
∑N
n=1 Cprior(sn, sn−1). For that
purpose, we need to express the optimization problem as
a composition of smaller sub-problems.
In order to simplify our description we use a special
notation to refer to the three different types of stixels
considered in this work: obtb = {vb, vt, object}, grtb =
{vb, vt, ground}, and sktb = {vb, vt, sky}. Similarly,
OBk, GRk, and SKk are defined as the minimum aggre-
gated cost of the best segmentation of columnDi from po-
sition 0 to position k, both included, for three cases: each
case corresponds to a segmentation ending on a stixel of
the corresponding class. The stixel at the end of the seg-
mentation associated with each minimum cost is denoted
as obk, grk, and skk, respectively. We next show a re-
4
cursive definition of the problem that can be solved by
dynamic programming:
OB0 = Cdata(ob
0
0) + Cprior(ob
0
0)
GR0 = Cdata(gr
0
0) + Cprior(gr
0
0)
SK0 = Cdata(sk
0
0) + Cprior(sk
0
0)
(5)
OBk = min

Cdata(ob
k
0) + Cprior(ob
k
0)
Cdata(ob
k
1) + Cprior(ob
k
1 , ob
0) +OB0
Cdata(ob
k
1) + Cprior(ob
k
1 , gr
0) +GR0
Cdata(ob
k
1) + Cprior(ob
k
1 , sk
0) + SK0
...
Cdata(ob
k
k) + Cprior(ob
k
k, ob
k−1) +OBk−1
Cdata(ob
k
k) + Cprior(ob
k
k, gr
k−1) +GRk−1
Cdata(ob
k
k) + Cprior(ob
k
k, sk
k−1) + SKk−1
(6)
Eq. 5 represents the base case problem: segmenting
a column of the single pixel at the bottom. Eq. 6 indi-
cates how to solve a problem of size k using the solutions
for smaller problems, computed so far. We only show the
case for object stixels, but the other cases are solved sim-
ilarly. All the possible object stixels ending at position
k (and starting at positions from 0 to k) are connected
with the last stixel of the minimal cost segmentations of
the corresponding size, which were previously computed.
Connections are evaluated for the three stixel classes us-
ing the smoothness term (prior model).
All the partial solutions OBk, GRk, and SKk, are
stored in a cost table C during the solving steps of the
recurrent algorithm. As shown by Eq. 6, solving a
sub-problem of size k using the previous solutions for
j:0≤j≤k, requires considering the k possible positions of
a cut between stixels and the 3 possible classifications of
the stixels. Assuming that the number of different classes
is constant (3 classes), the complexity of the stixel esti-
mation problem for a single column is O(h× h), and the
complexity for the stixel segmentation of the whole dis-
parity image is O(h2 × ws ).
Once the cost table C is completely calculated, and in
order to find the optimal configuration of stixels, a back-
tracking procedure has to be performed starting from the
top row of C and computing the successive minimum
value Ch−1min = min(OB
h−1, GRh−1, SKh−1). This
task is sped up by using an index table that links each
stixel to the next stixel with minimum cost, which is up-
dated during the solving process.
3.4 Using LUTs to optimize performance
The usage of Look-Up Tables (LUTs) reduces the amount
of computation and memory accesses required to solve
the problem, and assures that the algorithmic complexity
is the one calculated so far. First, we review some opti-
mizations presented in [10], and then present a new one
that provides good results both on CPU and GPU.
Computing the cost for each of the stixels considered
when solving the problem using Eq. 4 is very expensive.
But most of the terms in the equation do not depend on the
input data and can be pre-computed. Accordingly, only
the last term has to be actually computed for each dis-
parity measurement (dv). We must consider two different
cases.
The cost for pixels classified as ground or sky only
depends on the current disparity of the pixel. Then, the
response of the sensor model provided by Eq. 4 can be
pre-computed for each of the disparities in the input col-
umn, and stored in the LUT. Therefore, the total compu-
tation complexity for computing and storing the cost for
ground and sky is O(h× ws ).
The cost of a pixel classified as object, though, de-
pends not only on the current disparity of the pixel but
also on the mean disparity of the segment. In order to
limit the total amount of possible input combinations and
to reduce the complexity of pre-computing all the corre-
sponding cost values, we round the mean disparities into
integer values. This approach provides satisfactory qual-
ity results while reducing the number of pre-computed
values to all the possible combinations of the h dispar-
ity values on each input column and the drange possible
average disparities, which account for a total of h×drange
values.
The algorithmic improvement with more impact on per-
formance comes from the use of prefix sums [4] to reduce
the total amount of operations needed to calculate the total
cost of the pixels of a stixel. The prefix sum of a vector of
numbers is a new vector that holds the accumulated cost
corresponding to the first k pixels. Prefix sums extended
5
to 2D or 3D matrices are known in the field of image pro-
cessing as integral images or summed-area tables [16].
The LUTs described so far contain the prefix sum of
the costs corresponding to each pixel in the input column.
Then, calculating the cost of a stixel sn = {vbn, vtn, cn}
is done in constant time just by subtracting two numbers
in the table, independently of the size of the stixel. The
LUTs of the pre-computed cost for ground or sky stixels
are indexed just by the positions of the first and last pixels
of the stixel: Cdata(sn) = LUTcn [v
t
n] − LUTcn [vb−1n ].
The LUT of the pre-computed cost for object stixels
is indexed by the positions of the stixel but also by
the average disparity of the stixel (fn): Cdata(sn) =
LUTobject[fn][v
t
n] − LUTobject[fn][vbn − 1]. Again, the
average disparity of a given stixel, fn, is computed in con-
stant time using a pre-computed prefix sum of the dispar-
ities of the pixels in the processed column.
In our design we propose using a new LUT containing
the pre-computed costs for all possible pairs of pixel dis-
parity and mean disparity (a 2D matrix of drange×drange
elements). Since the contents of this table are indepen-
dent on the input data, the table can be computed off-line
and avoids any computation from Eq. 4 to be executed
during the normal process of stixel estimation. We have
experimentally verified that this new LUT improves the
performance both on CPU and on GPU.
To summarize, some LUTs are computed off-line,
while the LUTs containing prefix sums must be computed
for each new disparity image. The higher computational
work of generating the LUTs corresponds to the creation
of the two-dimensional object LUTs, with a complexity
of O(h × drange × ws ) operations per input image. This
complexity is comparable to the complexity of the dy-
namic programming step as long as h is higher or equal
to drange, which is often the case.
4 GPU architecture and perfor-
mance
GPUs are massively-parallel devices containing tens of
throughput-oriented processing units called streaming
multiprocessors (SMs). Memory and compute operations
are executed as vector (SIMD) instructions and are highly
pipelined in order to save energy and transistor budged.
SMs can execute several vector instructions per cycle,
selected from multiple independent execution flows: the
higher the available thread-level parallelism the better the
pipeline utilization.
The CUDA programming model allows defining a mas-
sive number of potentially concurrent execution instances
(called threads) of the same program code. A unique
two-level identifier <ThrId, CTAid> is used to spe-
cialize each thread for a particular data and/or function.
A CTA (Cooperative Thread Array) comprises all
the threads with the same CTAid, which run simulta-
neously and until completion in the same SM, and can
share a fast but limited memory space: the so-called
SharedMemory. Warps are groups of threads with
consecutive ThrIds in the same CTA that are mapped by
the compiler to vector instructions and, therefore, advance
their execution in a lockstep synchronous way. The warps
belonging to the same CTA can synchronize using an ex-
plicit barrier instruction. Each thread has its own private
LocalMemory space (commonly assigned to registers by
the compiler), while a large space of GlobalMemory is
public to all execution instances (mapped into a large-
capacity but long-latency device memory, which is accel-
erated using a two-level hierarchy of cache memories).
The parallelization scheme of an algorithm and the data
layout determine the available parallelism at the instruc-
tion and thread level (required for achieving full resource
usage) and the memory access pattern. GPUs achieve ef-
ficient memory performance when the set of addresses
generated by a warp refer to consecutive positions that
can be coalesced into a single, wider memory transac-
tion. Since the bandwidth of the device memory can be a
performance bottleneck, an efficient CUDA code should
promote data reuse on shared memory and registers.
5 Massive Parallelization
This section describes and discusses the parallelization
schemes and data layouts used for the algorithms involved
in the stixel computation pipeline.
5.1 Column Reduction and Transpose
Stixels are computed on columns of width s and height
h. The first step in the pipeline is to reduce the width
6
Figure 2: Column Reduction and Transpose of input Dis-
parity image: parallel scheme and computational analysis
of each column into a single pixel by replacing the dis-
parities of s consecutive pixels in the same row by their
average. The input disparity image arranges pixels into
consecutive rows of memory (row-wise), but this is not
the appropriate data layout for the tasks that are required
later, where information is processed in columns. There-
fore, we fuse a transpose operation with the column re-
duction operation described so far into the same algorith-
mic step. The scheme of the operation is shown in Fig.
2.
The data access pattern of the algorithm exhibits no
data reuse and, therefore, thread cooperation is not re-
quired. As shown in Fig. 2, work is distributed by as-
signing to each thread the task of reducing the dispari-
ties of s consecutive pixels into a single output dispar-
ity value, which must be stored in a transposed position.
Each thread reads s consecutive input disparities, com-
putes their mean, and writes one result. We further assign
the consecutive threads of a warp to consecutive output
positions, so that writes are coalesced and write perfor-
mance is maximized. Reads, though, are not coalesced
(a typical situation on transpose operations) and will per-
form sub-optimally. We improve read performance by
reading consecutive pixels in groups of 4 or 2 when pos-
sible. A collaborative read strategy using Shared Memory
would slightly improve performance but, since this pro-
cessing step represents a very small amount of time on
the pipeline (≤ 0.5%), we preferred a simple solution.
5.2 Pre-computation of LUTobject
As explained in subsection 3.4, a specific look-up table
for the object data-term (LUTobject) has to be generated
for each input columnDi, for a total ofO(h×drange×ws )
output values. Since LUTobject is too large to fit into
Shared Memory, it must be written to Global Memory,
and then we can isolate this task from the whole process-
ing pipeline without losing performance.
As shown in Fig. 3, work is distributed by assigning
to a single warp the task of generating one row of the
LUTobject matrix corresponding to a single input column
Di. Warps in the same CTA cooperate by reading the in-
put column into Shared Memory. Then, each warp com-
putes the prefix sum of the cost vector corresponding to
one row in the LUT. The warp iterates on the h elements
of Di, processing warpsize = 32 elements on each it-
eration step. Data read from LUTcost is used straight-
forwardly to compute the prefix sum directly into regis-
ters (Local Memory), using register-to-register shuffle
instructions, and affording memory reads and writes. No
explicit synchronization is required when operating warp-
wise, as shown by [6].
The performance bottleneck of this stage, which repre-
sents less than 5% of the time of the whole pipeline, is the
write bandwidth to the external device memory.
5.3 Dynamic Programming (DP) stage
This is the most time-consuming (≥ 95%) processing
step, and also the most elusive for massive parallelization.
Each input column Di can be processed independently to
generate a list of estimated stixels, but this amount of par-
allelism is not enough to efficiently exploit current GPUs
for the image sizes considered in most real-world appli-
cations. The challenge is to extract fine-grain parallelism
inside the DP task corresponding to each input column.
Processing columnDi involves multiple repeated reads
to all the corresponding LUTs, and reading and updating
the contents of the corresponding cost table Ci, as shown
(yellow) in Fig. 4. In order to improve data access per-
formance we promote data reuse by assigning a separate
Cooperative Thread Array (CTA) of h threads to each
DP task. Before performing the actual DP solving pro-
cess, threads cooperate to copy some general LUTs from
Global Memory to Shared Memory and to compute the
7
Figure 3: Pre-computation of LUTobject: parallel scheme
and computational analysis
prefix sums of LUTground, LUTsky and Di into Shared
Memory using [6]. LUTobjecti is the only data struc-
ture that does not fit into Shared Memory and must be
accessed from Global Memory.
The DP recurrence shown in Eq. 6 formulates how to
calculate the minimum cost of a sub-problem with k pix-
els using the results computed for smaller sub-problems.
We distribute the DP solving task by making each CTA
thread responsible of computing the minimum cost for
each problem size k (0≤k<h). Two issues are derived
from Eq. 6 that complicate parallel execution: (1) the
work assigned to each thread is not well balanced, since
it is proportional to k (see the sequence of steps depicted
in Fig. 4); and (2) there are data dependencies that must
be preserved using synchronization operations. The cost
table Ci (in Shared Memory) is used to communicate par-
tial results between threads, and barrier synchronization
is used to enforce dependencies among the consecutive
DP steps. Also, every step of the DP solving process de-
creases the number of active parallel threads.
Synchronization barriers between recurrent steps, re-
duced warp parallelism in the CTA as the recurrence loop
Figure 4: Dynamic Programming (DP) stage: parallel
scheme and computational analysis
advances (an average of half the warps are active on each
CTA), and moderate warp divergence (an average of half
the threads are active on the last warp), prevent using the
available computation resources efficiently, making per-
formance latency-bounded.
5.4 Backtracking
The backtracking step is an inherently sequential process
(for each column). As described in subsection 3.3, it nav-
igates back on an index table created during the DP solv-
ing stage (not shown in the figures) and produces a list of
stixels with the optimal configuration for the column (see
Fig. 5).
The lack of parallelism in this final step seems to dis-
courage a GPU implementation, but we have found that
the time to transfer the resulting index tables to the CPU,
or even from Shared Memory to Global Memory, is higher
than the time to perform the task on the GPU (less than
0.5% of the overall execution time). Therefore, we fuse
this computing stage with the DP solving stage described
in the last subsection, and the last active thread in a CTA
is responsible of generating the final output. In order to
8
Figure 5: Backtracking: Parallel scheme
overcome the problem of handling variable-size lists of
stixels, we pre-allocate a fixed amount of Global Memory
for each list.
6 Results
This section assesses the quality results and the perfor-
mance of our proposal. A previous concern is to verify
that our proposal conforms to the algorithm defined so
far, and adopted from [12]. For that purpose, we used
both synthetic and real data. Stereo images generated us-
ing SYNTHIA [14] (like the one shown in Fig. 1) provide
examples with exact disparity maps and free space iden-
tification2. All the experiments using images including
cars, pedestrians, trees, and traffic signals, provided the
expected results.
We have used the manually-labeled data-set provided
by [13] for a preliminary evaluation of our implemen-
tation. We have selected ≈ 1500 stereo images from
the data-subset corresponding to good weather conditions,
since for those images we can generate acceptably good
stereo disparity map estimations at real-time with [7]. We
use the following metrics for comparing the quality of our
stixel estimation with the provided Ground Truth (GT)
stixel results:
• Detection Rate: we consider that a GT stixel has
been detected if the ratio of pixels that intersect with
an estimated object stixel with respect to the size of
the GT stixel is higher than 0.5. We show the propor-
tion of detected versus total number of GT stixels.
• False Positives: a stixel classified as an object is con-
sidered a false positive when more than 30 pixels are
2We thank the authors of SYNTHIA for providing us with the data
Detection Rate 88.7 %
% Image Pairs with False Positives 2.14 %
Total number of False Positives 155
Table 1: Quality results obtained using a publicly avail-
able data-set composed of 1495 image pairs
Figure 6: Example of Stixel World estimation. Sky stixels
are represented on blue, object stixels are represented on
green-to-red (read= close, green= far), and ground stixels
are transparent.
inside the free space determined by the GT stixels.
Table 1 shows the quality results obtained, which in-
dicate that our proposal provides similar results as [12].
A visual example of the stixel configuration computed by
our proposal can be seen in Fig. 6. We were not able to
compare our proposal with the original CPU implementa-
tion, since we could not obtain the stereo disparity maps
used on that work and the metrics were not properly de-
scribed.
We have also used multiple images of different sizes,
both real and synthetic, for the performance analysis. Our
main goal was to evaluate performance on a NVIDIA
Tegra X1 processor integrating 8 ARM cores and 2
Maxwell Streaming Multiprocessors (SMs), and with a
Thermal Design Power (TDP) of 10 Watts. For compar-
ison purpose, we have also measured performance on a
high-end NVIDIA Titan X (Maxwell), with 24 Maxwell
SMs and a TDP of 250W. We discard the time for CPU-
GPU data transfers based on the fact that stixel estimation
is just one stage in a general computation pipeline, which
receives a disparity map from an stereo computation stage
and provides a list of stixels for further post-processing.
Fig. 7 and 8 show, respectively, the performance
9
Figure 7: Frames per second obtained for different image
resolutions (w×h) on NVIDIA Tegra X1 and Titan X, for
s = 5.
throughput (frames per second, or fps) and the perfor-
mance per watt (fps/W) on both GPU systems and also
for different image resolutions. The high-end GPU al-
ways provides more than 11 times the performance of the
embedded GPU (as expected by the difference in number
of SMs), but the latter offers between 1.5 and 2 times more
performance per Watt. It is remarkable that real-time rates
(22.3 fps) are achieved by the Tegra X1 with 1280×480
resolution. Also, a high-end Titan X achieves very high
performance, e.g. around 373 fps with 1280×480 resolu-
tion.
The algorithm implemented by [10] reaches 13.3
frames per second on a multi-core CPU (Core i7 980X,
6×3.4 Ghz, 6 GB of RAM) for a input disparity image
of 1024×440 px and a stixel width of 5 px. Our imple-
mentation reaches 26 fps on a Tegra X1 for that resolution
(and 413 fps on a Titan X). Therefore, performance is im-
proved almost 2 times with respect to [10], while the per-
formance per watt ratio is 25 times better. As expected,
the algorithmic complexity makes execution time to grow
linearly with the image width but quadratically with the
image height.
Figure 8: Energy efficiency (FPS/Watt) for different im-
age resolutions (w×h) on NVIDIA Tegra X1 and Titan X,
for s = 5.
7 Conclusions
This paper has described and assessed the performance of
the -to the best of our knowledge- first GPU-accelerated
implementation of stixel estimation. Results have shown
that our proposal achieves real-time performance for real-
istic problem sizes, proving that the low-power envelope
and remarkable performance of embedded CPU-GPU hy-
brid systems make them good target platforms for most
real-time video processing tasks, paving the way for more
complex and larger applications.
The core of stixel estimation involves a Maximum A
Posteriori (MAP) probabilistic formulation that is solved
using a Dynamic Programming (DP) scheme. We have
proposed a parallel scheme and data layout for this com-
putational pattern that follows general optimization rules
based on a simple GPU performance model, and is then
expected to scale gracefully on the forthcoming GPU ar-
chitectures, like Pascal. Our proposal could be applied to
similar DP computational patterns.
Since performance and low consumption are always
welcome, for example to handle more and larger input
images, we will explore alternative algorithmic strategies
to further improve performance while maintaining good
quality. Also, we will incorporate tracking and cluster-
ing on the GPU-accelerated pipeline, which will open new
opportunities for improving stixel estimation quality.
10
References
[1] H. Badino, U. Franke, and D. Pfeiffer. The stixel world - A
compact medium level representation of the 3D-world. In
Pattern Recognition, 31st DAGM Symposium, Jena, Ger-
many, September 9-11, 2009. Proceedings, pages 51–60,
2009.
[2] R. Benenson, M. Mathias, R. Timofte, and L. J. V. Gool.
Fast stixel computation for fast pedestrian detection. In
Computer Vision - ECCV 2012. Workshops and Demon-
strations - Florence, Italy, October 7-13, 2012, Proceed-
ings, Part III, pages 11–20, 2012.
[3] R. Benenson, R. Timofte, and L. J. V. Gool. Stixels esti-
mation without depth map computation. In IEEE Interna-
tional Conference on Computer Vision Workshops, ICCV
2011 Workshops, Barcelona, Spain, November 6-13, 2011,
pages 2010–2017, 2011.
[4] G. E. Blelloch. Scans as primitive parallel operations.
IEEE Trans. Computers, 38(11):1526–1538, 1989.
[5] M. Enzweiler, M. Hummel, D. Pfeiffer, and U. Franke. Ef-
ficient stixel-based object recognition. In 2012 IEEE In-
telligent Vehicles Symposium, IV 2012, Alcala´ de Henares,
Madrid, Spain, June 3-7, 2012, pages 1066–1071, 2012.
[6] M. Harris, S. Sengupta, and J. D. Owens. Parallel prefix
sum (scan) with cuda. GPU gems, 3(39):851–876, 2007.
[7] D. Hernandez-Juarez, A. Chaco´n, A. Espinosa,
D. Va´zquez, J. C. Moure, and A. M. Lo´pez. Em-
bedded real-time stereo estimation via semi-global
matching on the GPU. In International Conference on
Computational Science 2016, ICCS 2016, 6-8 June 2016,
San Diego, California, USA, pages 143–153, 2016.
[8] H. Hirschmu¨ller. Stereo processing by semiglobal match-
ing and mutual information. IEEE Trans. Pattern Anal.
Mach. Intell., 30(2):328–341, 2008.
[9] M. Muffert, N. Schneider, and U. Franke. Stix-fusion:
A probabilistic stixel integration technique. In Canadian
Conference on Computer and Robot Vision, CRV 2014,
Montreal, QC, Canada, May 6-9, 2014, pages 16–23,
2014.
[10] D. Pfeiffer. The Stixel World - A Compact Medium-
level Representation for Efficiently Modeling Three-
dimensional Environments. PhD thesis, Hu Berlin, 2014.
[11] D. Pfeiffer and U. Franke. Efficient representation of traffic
scenes by means of dynamic stixels. In IEEE Intelligent
Vehicles Symposium (IV), 2010, La Jolla, CA, USA, June
21-24, 2010, pages 217–224, 2010.
[12] D. Pfeiffer and U. Franke. Towards a global optimal multi-
layer stixel representation of dense 3D data. In British
Machine Vision Conference, BMVC 2011, Dundee, UK,
August 29 - September 2, 2011. Proceedings, pages 1–12,
2011.
[13] D. Pfeiffer, S. Gehrig, and N. Schneider. Exploiting the
power of stereo confidences. In 2013 IEEE Conference on
Computer Vision and Pattern Recognition, Portland, OR,
USA, June 23-28, 2013, pages 297–304, 2013.
[14] G. Ros, L. Sellart, J. Materzynska, D. Vazquez, and
A. Lopez. The SYNTHIA Dataset: A large collection
of synthetic images for semantic segmentation of urban
scenes. 2016.
[15] L. Schneider, M. Cordts, T. Rehfeld, D. Pfeiffer, M. En-
zweiler, U. Franke, M. Pollefeys, and S. Roth. Semantic
stixels: Depth is not enough. In 2016 IEEE Intelligent
Vehicles Symposium, IV 2016, Gotenburg, Sweden, June
19-22, 2016, pages 110–117, 2016.
[16] P. A. Viola and M. J. Jones. Rapid object detection using
a boosted cascade of simple features. In 2001 IEEE Com-
puter Society Conference on Computer Vision and Pattern
Recognition (CVPR 2001), with CD-ROM, 8-14 December
2001, Kauai, HI, USA, pages 511–518, 2001.
[17] A. Wedel, H. Badino, C. Rabe, H. Loose, U. Franke, and
D. Cremers. B-spline modeling of road surfaces with an
application to free-space estimation. IEEE Trans. Intelli-
gent Transportation Systems, 10(4):572–583, 2009.
11
