The Stixel World is a medium-level, compact representation of road scenes that abstracts millions of disparity pixels into hundreds or thousands of stixels. The goal of this work is to implement and evaluate a complete multistixel estimation pipeline on an embedded, energy-efficient, GPU-accelerated device. This work presents a full GPUaccelerated implementation of stixel estimation that produces reliable results at 26 frames per second (real-time) on the Tegra X1 for disparity images of 1024×440 pixels and stixel widths of 5 pixels, and achieves more than 400 frames per second on a high-end Titan X GPU card.
Introduction
Advanced driver assistance systems (ADAS), autonomous vehicles, robots and other intelligent devices can estimate the distance of objects and the free space in a given scene by computing depth information from stereo camera systems or LIDARs. The large amount of low-level perpixel depth data is very costly to process and commonly a medium-level representation known as the stixel world [1] is used for road scenes. It relies on the fact that manmade environments mostly present horizontal and vertical planar surfaces, like roads, sidewalks or soil (horizontal), and buildings, pedestrians or cars (vertical) .
Stixels are segments of image columns that represent obstacles. They provide a compact representation that converts millions of disparity pixels to hundreds or thousands of stixels. Pfeiffer and Franke [12] proposed an extended representation that allows multiple stixels per column, providing a richer representation of the scene (See Fig. 1 ). Stixels are the basis for multiple extensions such as tracking [11] , grouping [5] or semantics [15] , and also serve as the input for further processing, like pedestrian detection [2] .
Calculating stixels is a complex task, comparable to that of generating dense stereo information, and the algorithm implemented on a multi-core CPU by [10] does not fulfill the real-time nor the energy-efficiency requirements of autonomous driving applications. Dedicated hardware designs (e.g. FPGA or ASIC) may achieve these goals, but are very inflexible and costly regarding changes in the algorithms, like combining stixels and semantic segmentation [15] . We explore GPU acceleration as an alternative. The appearance of embedded GPU-accelerated systems, like the NVIDIA Jetson TX1 and DrivePX platforms, opens the door for low-cost and low-energy consumption, realtime stixel computation. GPUs are very well suited for algorithms exhibiting massive and embarrassing parallelism, but may suffer high performance inefficiencies with algorithms that contain inherent dependencies, as those using dynamic programming techniques like the stixel algorithm presented by [12] . Careful work distribution and task cooperation, coupled with an appropriate data layout design, may overcome those difficulties and achieve competitive performance. Recently, [7] proved that GPU-acceleration of dense stereo computation using Semi Global Matching -mostly based on a 1D dynamic programming algorithmcan be successfully achieved.
The objective of this work is to implement and evaluate a GPU-accelerated software implementation 1 of a complete multi-stixel estimation pipeline, a missing work up to the best of our knowledge. We discuss the optimized massively parallel schemes and data layouts of each of the algorithms involved. Our proposal runs on a single Tegra X1 chip almost two times faster (26 fps versus 13.3 fps) and achieves 25 times better performance per watt ratio than the multicore implementation in [10] , for the same disparity image size (1024×440 px) and stixel width (5 px), providing the same high-quality results. The proposed design achieves 413 fps on a high-end Titan X GPU card, more than 30 times faster than [10] with a similar energy envelope. Figure 1 . Stixel world: taking the dense disparity map as the input, and estimating a certain ground slope and horizon line, columns are segmented into stixels and classified into ground, object and sky categories. Stereo and depth images are generated using SYNTHIA [14] The remainder of this paper is structured as follows. Section 2 reviews the state of the art on stixel computation. Section 3 reviews stixel formulation. Section 4 explains general basic concepts of GPU optimization while section 5 explains our proposed GPU-based optimizations for realtime stixel computation. Finally, in section 6 the accuracy and performance of our proposed method is evaluated.
Related work
The stixel world was introduced by Badino et al. [1] as an intermediate representation suitable for high-level tasks such as object detection or tracking [2, 11] . The ground surface of a scene, also called free-space (space without obstacles), is estimated using [17] from a depth map computed with an stereo algorithm such as SGM [8] . Then, dynamic programming is employed in order to find the bottom of the stixels and their heights.
In order to achieve real-time stixel estimation on CPU, Benenson et al. [3] developed a less accurate method to calculate the ground surface by accumulating matching costs into vertical-disparity space, and then computing the bottom and height of stixels in disparity-cost space. Disparities are computed only for object stixels and not for ground.
The work from [1, 3] considers only a single stixel per column, which represents an incomplete world model that e.g. cannot represent a pedestrian and a building on the same column. Pfeiffer and Franke [12] developed a unified probabilistic approach, also solved with dynamic programming, that considers the occurrence of multiple stixels per column.
The stixel world is the fundamental building block for more informative representations. Some extensions are stixel tracking, which provides a vector of velocity for each stixel [11] ; stixel grouping, where stixels that seem to belong to the same object are grouped together [5] ; and, finally, semantic stixels represent a combination of the stixel world and semantic segmentation [15] . All these extensions would greatly benefit from real-time stixel computation.
Muffert et al. [9] claim to run their FPGA implementation at 25 fps with stixel widths of 5 px, but the authors do not indicate the image resolution.
The Stixel World
Stixels provide a compact representation of man-made scenes, mainly roads, modeled as horizontal and vertical surfaces, which identify the ground and the objects on the scene, such as buildings, pedestrians, cars, or traffic lights. We follow the stixel world model defined in [10, 12] , where the reader can find more details.
Stixels are segments of image columns, with a certain height and distance, that are classified as ground, object or sky. Fig. 1 shows an example with a detailed column that is segmented into 5 different stixels. A comprehensive example can be found at the end of the paper, in Fig. 6 .
A hard assumption of the model is that stixels in different columns are independent. The ground slope and horizon line are assumed to be known for each image. The stixel computation problem is addressed using a unified probabilistic approach that incorporates real-world constraints such as perspective ordering, and is formulated as a MAP estimation problem that delivers an optimal segmentation of the columns with respect to the free-space and obstacle information [12] . A Dynamic Programming (DP) solving scheme incorporates the prior knowledge in order to minimize the global cost of the solution.
The following constraints are modeled as prior probabilities: (1) bayesian information criterion (BIC): there is a small number of objects in the scene; (2) gravity constraint: flying objects are unusual; (3) ordering constraint: the upper of two staggered stixels classified as objects is expected to be further away; (4) staggering constraint: some configurations, like having a ground stixel above a sky stixel, are impossible; and (5) diving constraint: the base point depth of an object stixel should be equal or greater than the corresponding ground depth.
Next we will formalize the problem, define the DP recursive equation, and discuss some algorithmic optimizations.
Formal description of MAP problem
The original disparity image D of w×h pixels is preprocessed to convert each column of width s into a column of a single pixel, resulting in a reduced disparity image D of w s ×h pixels. Since all columns are processed independently, we restrict our description to finding the optimal stixel segmentation of a single column D i .
A stixel is a segment s n = {v b n , v t n , c n }, where v b n and v t n are, respectively, the base (beginning) and top (ending) positions in the column (0≤v b n ≤v t n ≤h), and c n is a label from class C = {object, ground, sky}.
Each stixel class defines a different theoretical model for the disparity of each pixel v along the stixel, v b n ≤ v ≤ v t n . This model is defined as a function f :
• The function for ground stixels is defined to match the disparity gradient of the ground surface:
• The function for sky stixels is zero (modeling very far away pixels): f sky (v) = 0
• The function for object stixels is supposed to have constant disparity, but this value depends on the particular object corresponding to the stixel. We model the function as f object n (v) = f n , and this constant value is computed as the mean of the measured disparities of the considered stixel.
L∈L is an ordered list of N consecutive, adjacent, and non-overlapped stixels {s n }, 1≤n≤N ≤h, where L is the set of possible segmentations. Then, the stixel computation problem is formulated as a Maximum A Posteriori (MAP) estimation problem:
Data & Smoothness terms
Applying the Bayes' theorem, the posterior probability P (L|D i ) in Eq. 1 can be rewritten as P (L|D i ) ≈ P (D i |L)P (L), where the first term, or data term, is the conditional probability of having the input column D i given a labeling L, and the second term, or smoothness term, is the prior probability of the configuration L.
It is assumed that the data term can be computed independently for each pixel in the stixel and for all the stixels in the segmentation. This allows to express the data term with the following formula:
The data term P Di (d v |s n , v) models the probability for a single disparity measurement d v at row v to belong to a given stixel s n . A sensor model estimates the likelihood (or cost) that the measured disparity data (d v ) matches the theoretical disparity model of a given stixel (f (v)). Following the proposal in [10] , we define this model as a combination of a Gaussian and a Uniform distribution.
The uniform distribution models the probability of an outlier occurrence (p out is the outlier rate), sometimes due to invalid disparity measurements or by incorrectly matched pixels on the scene. d range is the total number of disparities considered in the input disparity map. The Gaussian distribution assesses the affinity of the measured disparity with the theoretical disparity function of the stixel. A norm is a normalization term, and σ c n (f, v) is a sigmoid function that incorporates the noise model for the disparity measurement.
In order to avoid numerical problems with small magnitudes of the individual probabilities and to simplify Eq. 3, the MAP estimation problem is expressed using the logarithm of the likelihoods instead of the actual likelihood. The optimization problem is then converted to a cost minimization problem. Following is the final equation to model the data term cost of a given pixel v belonging to a given stixel s n . The data term cost of a stixel is the aggregation of the costs of all of its pixels,
The prior probability, or smoothness term, models the real-world constraints described at the beginning of this section, and is defined as a set of cost tables (log-likelihoods instead of actual probabilities). These constraints only consider the likelihood of the first stixel and the pairwise mutual dependencies of a pair of adjacent stixels C prior (s n , s n−1 ). In our proposal we use the same model and parameters described in [10, 12] .
Solving Stixels with Dynamic Programming
Dynamic Programming solves problems by breaking them down to simpler subproblems and storing the partial solutions on a memory structure. This way, when a given subproblem appears again, computation time is saved by retrieving the partial solution from the memory structure and not by solving the same subproblem repeatedly.
The dynamic programming scheme is used for computing the stixel segmentation L= {s n } with minimum global cost for a column D i . The global cost is composed of a data term C data (L) = s n ∈L C data (s n ) and a smoothness term C prior (L) = N n=1 C prior (s n , s n−1 ). For that purpose, we need to express the optimization problem as a composition of smaller sub-problems.
In order to simplify our description we use a special notation to refer to the three different types of stixels considered in this work:
Similarly, OB k , GR k , and SK k are defined as the minimum aggregated cost of the best segmentation of column D i from position 0 to position k, both included, for three cases: each case corresponds to a segmentation ending on a stixel of the corresponding class. The stixel at the end of the segmentation associated with each minimum cost is denoted as ob k , gr k , and sk k , respectively. We next show a recursive definition of the problem that can be solved by dynamic programming:
Eq. 5 represents the base case problem: segmenting a column of the single pixel at the bottom. Eq. 6 indicates how to solve a problem of size k using the solutions for smaller problems, computed so far. We only show the case for object stixels, but the other cases are solved similarly. All the possible object stixels ending at position k (and starting at positions from 0 to k) are connected with the last stixel of the minimal cost segmentations of the corresponding size, which were previously computed. Connections are evaluated for the three stixel classes using the smoothness term (prior model).
All the partial solutions OB k , GR k , and SK k , are stored in a cost table C during the solving steps of the recurrent algorithm. As shown by Eq. 6, solving a sub-problem of size k using the previous solutions for j:0≤j≤k, requires considering the k possible positions of a cut between stixels and the 3 possible classifications of the stixels. Assuming that the number of different classes is constant (3 classes), the complexity of the stixel estimation problem for a single column is O(h × h), and the complexity for the stixel segmentation of the whole disparity image is O(h 2 × w s ). Once the cost table C is completely calculated, and in order to find the optimal configuration of stixels, a backtracking method is performed starting from the top row of C and computing the successive minimum value C h−1 min = min(OB h−1 , GR h−1 , SK h−1 ). This task is sped up by using an index table, updated during the solving process, that links each stixel to the next stixel with minimum cost.
Using LUTs to optimize performance
The usage of Look-Up Tables (LUTs) reduces the amount of computation and memory accesses required to solve the problem, and assures that the algorithmic complexity is the one calculated so far. First, we review some optimizations presented in [10] , and then present a new one that provides good results both on CPU and GPU.
Computing the cost for each of the stixels considered when solving the problem using Eq. 4 is very expensive. But most of the terms in the equation do not depend on the input data and can be pre-computed. Accordingly, only the last term has to be actually computed for each disparity measurement (d v ). We must consider two different cases.
The cost for pixels classified as ground or sky only depends on the current disparity of the pixel. Then, the response of the sensor model provided by Eq. 4 can be precomputed for each of the disparities in the input column, and stored in the LUT. Therefore, the total computation complexity for computing and storing the cost for ground and sky is O(h × w s ). The cost of a pixel classified as object, though, depends not only on the current disparity of the pixel but also on the mean disparity of the segment. In order to limit the total amount of possible input combinations and to reduce the complexity of pre-computing all the corresponding cost values, we round the mean disparities into integer values. This approach provides satisfactory quality results while reducing the number of pre-computed values to all the possible combinations of the h disparity values on each input column and the d range possible average disparities, which account for a total of h×d range values.
The algorithmic improvement with more impact on performance comes from the use of prefix sums [4] to reduce the total amount of operations needed to calculate the total cost of the pixels of a stixel. The prefix sum of a vector of numbers is a new vector that holds the accumulated cost corresponding to the first k pixels. Prefix sums extended to 2D or 3D matrices are known in the field of image processing as integral images or summed-area tables [16] .
The LUTs described so far contain the prefix sum of the costs corresponding to each pixel in the input column. Then, calculating the cost of a stixel s n = {v b n , v t n , c n } is done in constant time just by subtracting two numbers in the table, independently of the size of the stixel. The LUTs of the pre-computed cost for ground or sky stixels are indexed just by the positions of the first and last pixels of the stixel:
The LUT of the pre-computed cost for object stixels is indexed by the positions of the stixel but also by the average disparity of the stixel (f n ):
. Again, the average disparity of a given stixel, f n , is computed in constant time using a precomputed prefix sum of the disparities of the pixels in the processed column.
In our design we propose using a new LUT containing the pre-computed costs for all possible pairs of pixel disparity and mean disparity (a 2D matrix of d range × d range elements). Since the contents of this table are independent on the input data, the table can be computed off-line and avoids any computation from Eq. 4 to be executed during the normal process of stixel estimation. We have experimentally verified that this new LUT improves the performance both on CPU and on GPU.
To summarize, some LUTs are computed off-line, while the LUTs containing prefix sums must be computed for each new disparity image. The higher computational work of generating the LUTs corresponds to the creation of the two-dimensional object LUTs, with a complexity of O(h × d range × w s ) operations per input image. This complexity is comparable to the complexity of the dynamic programming step as long as h is higher or equal to d range , which is often the case.
GPU architecture and performance
GPUs are massively-parallel devices including tens of throughput-oriented processing units called streaming multiprocessors (SMs). In order to save energy and transistor budget, memory and compute operations are executed as highly pipelined vector (SIMD) instructions. SMs can execute several SIMD instructions per cycle, selected from several independent execution flows: the higher the available thread-level parallelism the better the pipeline utilization. The CUDA programming model allows defining a massive number of potentially concurrent execution instances (called threads) of the same program code. A unique twolevel identifier <T hrId, CT Aid> is used to specialize each thread for a particular data and/or function. A CT A (Cooperative T hread Array) comprises all the threads with the same CT Aid, which run simultaneously and until completion in the same SM, and can share a fast but limited memory space: the so-called SharedM emory. W arps are groups of threads with consecutive T hrIds in the same CTA that are mapped by the compiler to vec- Figure 2 . Column Reduction and Transpose of input Disparity image: parallel scheme and computational analysis tor instructions and, therefore, advance their execution in a lockstep synchronous way. The warps belonging to the same CTA can synchronize using an explicit barrier instruction. Each thread has its own private LocalM emory space (commonly assigned to registers by the compiler), while a large space of GlobalM emory is public to all execution instances (mapped into a large-capacity but long-latency device memory, which is accelerated using a two-level hierarchy of cache memories).
The parallelization scheme of an algorithm and the data layout determine the available parallelism at the instruction and thread level (required for achieving full resource usage) and the memory access pattern. GPUs achieve efficient memory performance when the set of addresses generated by a warp refer to consecutive positions that can be coalesced into a single, wider memory transaction. Since the bandwidth of the device memory can be a performance bottleneck, an efficient CUDA code should promote data reuse on shared memory and registers.
Massive Parallelization
This section describes and discusses the parallelization schemes and data layouts used for the algorithms involved in the stixel computation pipeline.
Column Reduction and Transpose
Stixels are computed on columns of width s and height h. The first step in the pipeline is to reduce the width of each column into a single pixel by replacing the disparities of s consecutive pixels in the same row by their average. The input disparity image arranges pixels into consecutive rows of memory (row-wise), but this is not the appropriate data layout for the tasks that are required later, where information is processed in columns. Therefore, we fuse a transpose operation with the column reduction operation described so far into the same algorithmic step. The scheme of the operation is shown in Fig. 2 .
The data access pattern of the algorithm exhibits no data reuse and, therefore, thread cooperation is not required. As shown in Fig. 2 , work is distributed by assigning to each thread the task of reducing the disparities of s consecutive pixels into a single output disparity value, which must be stored in a transposed position. Each thread reads s consecutive input disparities, computes their mean, and writes one result. We further assign the consecutive threads of a warp to consecutive output positions, so that writes are coalesced and write performance is maximized. Reads, though, are not coalesced (a typical situation on transpose operations) and will perform sub-optimally. We improve read performance by reading consecutive pixels in groups of 4 or 2 when possible. A collaborative read strategy using Shared Memory would slightly improve performance but, since this processing step represents a very small amount of time on the pipeline (≤ 0.5%), we preferred a simple solution.
Pre-computation of LU T object
As explained in subsection 3.4, a specific look-up table for the object data-term (LU T object ) has to be generated for each input column D i , for a total of O(h × d range × w s ) output values. Since LU T object is too large to fit into Shared Memory, it must be written to Global Memory, and then we can isolate this task from the whole processing pipeline without losing performance.
As shown in Fig. 3 , work is distributed by assigning to a single warp the task of generating one row of the LU T object matrix corresponding to a single input column D i . Warps in the same CTA cooperate by reading the input column into Shared Memory. Then, each warp computes the prefix sum of the cost vector corresponding to one row in the LUT. The warp iterates on the h elements of D i , processing warp size = 32 elements on each iteration step. Data read from LU T cost is used straightforwardly to compute the prefix sum directly into registers (Local Memory), using register-to-register shuf f le instructions, and affording memory reads and writes. No explicit synchronization is required when operating warp-wise, as shown by [6] .
The performance bottleneck of this stage, which represents less than 5% of the time of the whole pipeline, is the write bandwidth to the external device memory.
Dynamic Programming (DP) stage
This is the most time-consuming (≥ 95%) processing step, and also the most elusive for massive parallelization. Each input column D i can be processed independently to generate a list of estimated stixels, but this amount of parallelism is not enough to efficiently exploit current GPUs for the image sizes considered in most real-world applications. The challenge is to extract fine-grain parallelism inside the DP task corresponding to each input column.
Processing column D i involves multiple repeated reads to all the corresponding LUTs, and reading and updating the contents of the corresponding cost table C i , as shown (yellow) in Fig. 4 . In order to improve data access performance we promote data reuse by assigning a separate Cooperative Thread Array (CTA) of h threads to each DP task. Before performing the actual DP solving process, threads cooperate to copy some general LUTs from Global Memory to Shared Memory and to compute the prefix sums of LU T ground , LU T sky and D i into Shared Memory using [6] . LU T object i is the only data structure that does not fit into Shared Memory and must be accessed from Global Memory.
The DP recurrence shown in Eq. 6 defines the minimum cost of a problem with k pixels as a function of the cost of smaller problems. We distribute the DP task assigning to each CTA thread the calculation of the minimum cost for each problem size k (0≤k<h). Two issues are derived from Eq. 6 that tangle parallel execution: (1) the work assigned to each thread is not well balanced, since it is proportional to k (see the sequence of steps depicted in Fig. 4) ; and (2) there are data dependencies that must be preserved. The cost table C i (in Shared Memory) is used to communicate partial results between threads, and barrier synchronization is used to enforce dependencies among the consecutive DP steps. Also, every step of the DP solving task decreases the number of active parallel threads.
Synchronization barriers between recurrent steps, reduced warp parallelism in the CTA as the recurrence loop advances (an average of half the warps are active on each CTA), and moderate warp divergence (an average of half the threads are active on the last warp), prevent using the available computation resources efficiently, making performance latency-bounded. 
Backtracking
The backtracking step is an inherently sequential process (for each column). As described in subsection 3.3, it navigates back on an index table created during the DP solving stage (not shown in the figures) and produces a list of stixels with the optimal configuration for the column (see Fig. 5 ).
The lack of parallelism in this final step seems to discourage a GPU implementation, but we have found that the time to transfer the resulting index tables to the CPU, or even from Shared Memory to Global Memory, is higher than the time to perform the task on the GPU (less than 0.5% of the overall execution time). Therefore, we fuse this computing stage with the DP solving stage described in the last subsection, and the last active thread in a CTA is responsible of generating the final output. In order to overcome the problem of handling variable-size lists of stixels, we pre-allocate a fixed amount of Global Memory for each list.
Results
This section assesses the quality results and the performance of our proposal. A previous concern is to verify that our proposal conforms to the algorithm defined so far, and adopted from [12] . For that purpose, we used both synthetic and real data. Stereo images generated using SYNTHIA [14] (like the one shown in Fig. 1 ) provide examples with exact disparity maps and free space identification. All the experiments using images including cars, pedestrians, trees, and traffic signals, provided the expected results.
We have used the manually-labeled data-set provided by [13] for a preliminary evaluation of our implementation. We have selected ≈ 1500 stereo images from the data-subset corresponding to good weather conditions, since for those images we can generate acceptably good stereo disparity map estimations at real-time with [7] . We use the following metrics for comparing the quality of our stixel estimation with the provided Ground Truth (GT) stixel results:
• Detection Rate: we consider that a GT stixel has been detected if the ratio of pixels that intersect with an estimated object stixel with respect to the size of the GT stixel is higher than 0.5. We show the proportion of detected versus total number of GT stixels.
• False Positives: a stixel classified as an object is considered a false positive when more than 30 pixels are inside the free space determined by the GT stixels. Table 1 shows the quality results obtained, which indicate that our proposal provides similar results as [12] . A visual example of the stixel configuration computed by our proposal can be seen in Fig. 6 . We were not able to compare our proposal with the original CPU implementation, since we could not obtain the stereo disparity maps used on that work and the metrics were not properly described.
We have also used multiple images of different sizes, both real and synthetic, for the performance analysis. Our main goal was to evaluate performance on a NVIDIA Tegra X1 processor integrating 8 ARM cores and 2 Maxwell Streaming Multiprocessors (SMs), and with a Thermal Design Power (TDP) of 10 Watts. For comparison purpose, we have also measured performance on a high-end NVIDIA Titan X (Maxwell), with 24 Maxwell SMs and a TDP of 250W. Since the embedded Tegra X1 GPU uses the same physical memory for both the CPU (host) and the GPU (device), then it is not necessary to explicitly move data between host and device. On the contrary, the host-device transfer time on the system with the high-end GPU approximately represents 9% of the total computation time, which can be asynchronously overlapped with the computation time of the previous/next image frame. In many scenarios, stixel estimation is just one stage of a general computation pipeline, which receives a disparity image already resident on the device memory, generated by a previous GPUaccelerated stereo computation stage. Fig. 7 and 8 show the performance throughput (frames per second, or fps) and the performance per watt (fps/W) on both GPU systems and also for different image resolutions. The high-end GPU always provides more than 11 times the performance of the embedded GPU (as expected by the difference in number of SMs), but the latter offers between 1.5 and 2 times more performance per Watt. We use the TDP as an estimation of the actual energy consumption of each GPU. It is remarkable that real-time rates (22.3 fps) are achieved by the Tegra X1 with 1280×480 resolution. Also, a high-end Titan X achieves very high performance, e.g. around 373 fps with 1280×480 resolution.
The algorithm implemented by [10] reaches 13.3 frames per second on a multi-core CPU (Core i7 980X, 6×3.4 Ghz, 6 GB of RAM, TDP of 130W) for a input disparity image of 1024×440 px and a stixel width of 5 px. Our implementation reaches 26 fps on a Tegra X1 for that resolution (and 413 fps on a Titan X). Therefore, performance is improved almost 2 times with respect to [10] , while the performance per watt ratio is 25 times better. As expected, the algorithmic complexity makes execution time to grow linearly with the image width but quadratically with the image height.
Conclusions
This paper has described and assessed the performance of the -to the best of our knowledge-first GPU-accelerated implementation of stixel estimation. Results have shown that our proposal achieves real-time performance for realistic problem sizes, proving that the low-power envelope and remarkable performance of embedded CPU-GPU hybrid systems make them good target platforms for most realtime video processing tasks, paving the way for more complex and larger applications.
The core of stixel estimation involves a Maximum A Posteriori (MAP) probabilistic formulation that is solved using Dynamic Programming (DP). We have proposed a parallel scheme and data layout for this computational pat- tern that follows general optimization rules based on a simple GPU performance model, and is then expected to scale gracefully on the forthcoming GPU architectures. Our proposal can be applied to similar recurrent patterns with diminishing fine-grained parallelism.
Since performance and low consumption are always welcome, for example to handle more and larger input images, we will explore alternative algorithmic strategies to further improve performance while maintaining good quality. Also, we will incorporate tracking and clustering on the GPUaccelerated pipeline, which will open new opportunities for improving stixel estimation quality.
