Efficient code generation for image processing applications continues to pose a challenge in a domain where high performance is often necessary to meet real-time constraints. The inherently complex structure found in most image-processing pipelines, the plethora of transformations that can be applied to optimize the performance of an implementation, as well as the interaction of these optimizations with locality, redundant computation and parallelism, can be indentified as the key reasons behind this issue. Recent domain-specific languages (DSL) such as the Halide DSL and compiler attempt to encourage high-level design-space exploration to facilitate the optimization process. We propose a novel optimization strategy that aims to maximize producer-consumer locality by exploiting reuse in image-processing pipelines. We implement our analysis as a tool that can be used alongside the Halide DSL to automatically generate schedules for pipelines implemented in Halide and test it on a variety of benchmarks. Experimental results on three different multi-core architectures show an average performance improvement of 40% over the Halide Auto-Scheduler and 75% over a state-of-the art approach that targets the PolyMage DSL.
10:2 S. Sioutas et al.
final implementation needs to be highly optimized for the target platform. Traditional optimizations usually include a series of loop transformations, such as tiling and loop fusion, as well as vectorization and parallelization and aim to exploit locality (spatial and temporal), data-level parallelism, and task-level parallelism, respectively. However, manually applying these transformations significantly reduces code readability and portability and discourages high-level design-space exploration.
Recently, domain-specific languages (DSLs) such as Halide [21] and PolyMage [16] were introduced to facilitate the optimization process in high-performance image-processing applications. These DSLs allow developers to express applications in a more abstract format while maintaining the ability to apply low-level optimizations and transformations on the final code. The benefits of these approaches can be invaluable in the case of image-processing pipelines where a combination of optimizations including stage interleaving or stage fusion, tiling, vectorization, and parallelization are necessary to achieve high performance.
An image-processing pipeline can be defined as a series of functional stages, where each stage contains an arbitrary number of nested loops and depends on data produced during an earlier stage. As a result, interleaving the computation of these stages can offer significant performance improvement by exploiting producer/consumer locality and ensuring that intermediate buffers are kept inside the local caches or registers. Both Halide and PolyMage employ techniques that allow for the automatic optimization of such imaging pipelines. The Halide Auto-Scheduler [15] attempts to group stages together and evaluates an effective tiling in each group. PolyMage can use both auto-tuning to search parts of the design space as well as a recently introduced model-driven approach [10] . This new approach quickly attempts to fuse stages and extends the search space to cover more solutions than the previous auto-tuning method. However, all three techniques [10, 15, 16] focus on the interleaving of the computation of each stage using overlapping tiles and therefore lead to solutions with limited reuse possibilities and often miss sliding window opportunities.
In this article, we present a novel optimization strategy for image-processing pipelines that considers stage fusion for maximum producer/consumer locality in conjunction with tile size selection while evaluating reuse possibilities not considered in previous state-of-the-art approaches. Our technique is driven by an analytical model that takes relevant application and architecturespecific parameters (such as the number of cores/threads, cache size, interaction with hardware prefetching) into account and is capable of producing optimized schedules within seconds, even for complex pipelines with a large number of stages. We implement it as a tool to be used with the Halide DSL as an alternative cost model and analysis to the Halide Auto-Scheduler and evaluate it across a variety of benchmarks and target platforms. We compare our solutions to the ones produced by the Halide Auto-Scheduler, the manual solutions given for the Halide DSL on the same benchmarks when applicable, as well as the ones produced by PolyMage (using both the original auto-tuned method, as well as the DP-fusion technique implemented in Reference [10] ) on the same target architectures. We observe a substantial performance improvement across all platforms and architectures.
It is important to remark that our technique is not restricted to Halide. It can be used with other DSLs and general-purpose compilers that target image processing or tensor or linear algebra applications and offer control over the production and consumption of pipeline stages as well as the allocation of intermediate buffers.
The rest of this work is organized as follows: Section 2 discusses related work. Section 3 gives a motivational example, while Section 4 presents our proposed optimization technique in detail. Section 5 showcases the experimental results that were obtained. Conclusive remarks are finally discussed in Section 6.
RELATED WORK
In this section, we discuss prior related work. We identify the limitations of traditional loop fusion and tiling techniques used in general-purpose languages when optimizing image-processing pipelines and investigate some of the benefits of recent image-processing domain-specific languages.
General-purpose Languages
Loop fusion in conjunction with tiling has been extensively studied in the past, especially in the case of general-purpose compilers. Most of these approaches focus on exploiting data locality while maintaining parallelism in applications dominated by linear algebra or stencil computations [11, 17, 25, 27, 28] . More specifically, the authors in Reference [28] propose a hierarchical tiling technique for iterative solver applications to reduce communication overhead without introducing severe redundant computation. In Reference [17] , the effects of various inter-loop optimization strategies on PDE solvers are investigated.
In Reference [4] , the authors propose an optimization strategy for compute-intensive multidimensional summations that involve products of several arrays. They investigate the effects of loop fusion and tiling in such applications while also reducing the memory footprint of intermediate temporary buffer requirements.
Other approaches have focused on enabling loop fusion in applications with complex data dependencies between loop iterations [14, 26] . The authors of Reference [26] propose a technique that eliminates fusion-preventing dependencies by means of loop tiling and array copying. After iteratively applying the aforementioned method to multiple loop nests, a single equivalent nested loop can be formed that can be tiled for cache locality. In a similar fashion, Reference [14] proposes a way of mitigating the presence of fusion-preventing dependencies, while maintaining parallelism and eliminating cache conflicts in the subsequent fused loops.
However, all aforementioned methods involve traditional loop fusion techniques that target time-iterated stencils, the scope of which differs from the complex multi-dimensional problems defined in the context of image-processing pipelines, a term that covers all applications within the scope of this work. Stages in these pipelines perform various data-parallel computations before having their output consumed by the next stage, which in turn executes a different computation or stencil.
Domain-specific Languages
Recently, DSLs have emerged that enable quick design-space exploration in the image-processing domain. These DSLs provide high-level abstractions in the definitions of the functional steps inside the pipeline, as well as the ability to apply optimizations on the generated code to ensure high performance on the final implementation.
Tensor Comprehensions [24] is an example of a recent DSL that targets deep learning applications such as convolutional and recurrent neural networks. It consists of a high-level language with syntax that resembles the mathematics of deep learning and a Just-In-Time polyhedral compiler for CUDA-based GPU architectures. It employs an auto-tuner to automatically generate efficient polyhedral schedules.
PolyMage [16] is another DSL for image-processing applications that uses a dataflowlike language to describe pipelines. It employs polyhedral transformations [9, 13, 19] to optimize the computations performed by the functional stages of the pipeline with a grouping-then-tiling approach. More specifically, it relies on auto-tuning over various tile dimensions, which are all powers of two, to decide which stages of the pipeline will be grouped together. It then applies polyhedral optimizations on each group to generate the final nested loops. An alternative optimization strategy for pipelines implemented in PolyMage was introduced in Reference [10] . This method introduced a dynamic fusion and tiling model that extends the search space to tile sizes that are not powers of two and resolves the need for auto-tuning. However, due to the nature of the analysis that is used in the PolyMage compiler, its application scope is limited to stencil computations and up/down sampling.
Halide [21] is perhaps the most prominent of the DSL attempts. Halide separates the algorithmic description of a pipeline from its optimization schedule. Image-processing pipelines in Halide are defined as directed acyclic graphs, where each node of the graph represents a functional stage. Each stage is equivalent to a Halide function, which specifies all producer/consumer relations at the specific stage. Furthermore, the functional description of the pipeline is independent of its optimization schedule. In other words, the Halide functions define the relations and dependencies between the stage of the pipeline but do not influence the way the stages will get executed. As a result, the optimization schedule can control both the order of execution within a single stage, as well as the way the computation of stages gets interleaved during the execution of the pipeline. Figure 1 (b) shows a simple two-stage blur filter implemented in Halide, along with its optimization schedule. Given this schedule, the compiler will tile the loop of the blury filter using a tile size of 256 × 32, vectorize the innermost intra-tile loop (xi) using vectors of size 8, and parallelize its outermost inter-tile loop (yo). Furthermore, the computation of the blurx stage will be interleaved on a per-tile basis and its innermost loop will also be vectorized using vectors of size 8. In other words, before each intra-tile loop iteration, Halide will first allocate buffer space and compute all pixels of blurx that will get consumed during this iteration. The equivalent loop-nest in pseudo-C can be seen in Figure 1 (c).
Halide initially employed an auto-tuning framework to automatically generate optimized schedules for pipelines [21] that required an extensive amount of time to derive an adequate schedule. A more generic auto-tuning approach that is driven by genetic algorithms was proposed in the auto-tuning framework Opentuner [2] . This framework was able to generate efficient schedules in less time for small pipelines (e.g., bilateral grid), but fails to converge to a good solution for larger, more-complex problems.
Currently, Halide uses a heuristic-based Auto-Scheduler that was initially proposed in [15] but then received an updated cost model by the Halide community [7] . This method uses a greedy grouping algorithm to group stages of the pipeline together to maximize producer/consumer locality and applies tiling to the output stage of each group independently. However, the grouping strategy excludes parts of the design space and considers only a limited number of tile sizes, and its analysis does not cover buffer allocation and storage scheduling. As a result, while it can quickly produce schedules within seconds, it misses interesting solutions of the design space that may benefit from sliding window opportunities. Those missed solutions may, however, allow for better SIMD vector unit utilization and better exploitation of the hardware prefetchers.
Recently, another analytical model was introduced in Reference [23] that automatically schedules kernels in Halide. This method takes hardware prefetching into account and involves a hierarchical tiling approach, but it is limited to single-stage pipeline and -stage fusion falls outside its analysis.
Our method considers both the compute as well as the storage levels of a stage while determining its final optimization schedule. We show that by taking both compute and store level into account, we can reduce the amount of intermediate temporary buffer space required, which in return allows for different grouping and tiling options as well as increased producer/consumer locality. Furthermore, our analytical model takes hardware prefetching inherently into account and investigates tile sizes in a larger scope.
MOTIVATIONAL EXAMPLE AND PROBLEM FORMULATION
In this section, we use the blur pipeline seen in Figure 1 as a motivational example to demonstrate the limitations of current state-of-the-art approaches, as well as the idea behind our work.
As already mentioned, optimizing an image-processing pipeline usually involves dealing with a complex tradeoff among parallelism, locality, and recomputation. The transformations that are often considered include a combination of loop interchange, splitting, fusion, parallelization, and vectorization. Choosing a proper fusion strategy for each stage in a pipeline has a significant effect on the performance of the final implementation. Figure 2 shows three example schedules for the blur pipeline in pseudo-C syntax. The amount of reuse or recomputation, as well as the size of the intermediate buffer that is required can be controlled through the combination of various loop permutations, tile sizes, and levels at which we compute and store each stage of the pipeline. For example, the solution shown in Figure 2 (a) computes all necessary pixels in blurx before consuming them to compute blury. Such a schedule avoids all recomputation but suffers from poor locality and a large intermediate buffer (depending on the problem size). However, fully inlining the producer (blurx ) into its consumer (blury) increases locality but at the cost of the highest recomputation.
Current state-of-the-art approaches (e.g., the current Halide Auto-Scheduler) only consider scheduling options where compute and store are set to the same level of a loop nest. As an example, consider the schedule seen in Figure 2 (b). In this case, tiling the iteration space of blury and fusing its producer into the innermost inter-tile loop (xo) allows for an intermediate solution that offers increased locality compared to the fully stored implementation and less recomputation than the fully inlined one. Furthermore, it does not hinder parallelization of the outermost 10:6 S. Sioutas et al.
inter-tile loop, since the computation of blurx is interleaved at a lower level than the parallel loop. We can quantify the amount of intermediate storage (B blur x ) needed as well as the amount of recomputation (R blur x ) in such a schedule for arbitrary tile dimensions:
where T x ,T y are the tile sizes in the x and y dimensions and v y is the amount of overlap between blurx and its consumer blury in the y dimension (in this example v y = 2),
where Cp
is the total computation cost of blurx in this fusion scenario and C root blur x
is the cost when all of its pixels are computed and stored before being consumed (as in Schedule (a)).
More specifically, Cp
is the cost of blurx when fused into blury with its computation (subscript xo) and allocation (superscript xo) set to the xo level/index of the loop-nest of blury.
is the cost of computing and storing blurx outside the loop nest of the consuming stage blury,
where B x , B y are the problem sizes (loop bounds) in the x and y dimensions, respectively. 1 Similar equations can be used to calculate the load cost (C l ) for blurx, which is equivalent to the load cost for the input data (C blur x input ). In detail:
In the presence of a streaming hardware prefetcher, 2 the previous equation becomes
where we eliminate the sequential accesses across cache lines. Finally, the amount of data that needs to stay in the cache to benefit from input data reuse is
Such a schedule benefits from increased locality compared to the one with root storage. Furthermore, as seen in the above equations, the tradeoff between redundant computation and locality can be controlled by tuning the applied tile dimensions. It should also be noted that a different inter or intra-tile loop permutation leads to different buffer requirements and cost-functions. Figure 3 shows an implementation where computation and storage are set to different levels. Such schedules benefit from sliding window opportunities that usually enable the folding of intermediate buffers without reducing the amount of data reuse. As an example, consider the buffer requirements for this schedule (all of the other costs will be the same as in schedule 2(b)). Starting from Equation (1) (since the store level remains the same), we can calculate the memory footprint of the producer blurx stage as follows:
Since pixels of blurx are now computed per line of the output tile (yi), we do not need to keep all of them in the intermediate buffer but only those that can be reused across intra-tile iterations (or across one xo iteration). Therefore, B blur x can be folded down to a circular buffer of size:
The same holds for the input data buffer, which will now be
and will be folded down to:
Figure 3(b) shows a visual representation for a small 8 × 8 output image with an applied tile size of 4 × 4 (B x = 8, B y = 8, T x = 4, and T y = 4). Unlike Figure 3 (a), pixels of the producing stage blurx are produced per line (yi) of the consuming stage blury as needed. For example, three lines of width equal to Tx will be computed during the first yi iteration to produce one tile row, but only one line of blurx will need to be computed for yi > 0, since two lines may be reused. Pixels that are no longer needed (cannot be reused across one iteration of xo) are discarded.
Note that the above schedule does not ensure maximum folding of the intermediate buffer allocated for blurx. For example, consider the schedule seen in Figure 4 (a), interchanging the loop such that the ordering (from innermost to outermost) is (yi, xi, yo, xo), setting the compute level of blurx to yi, and its storage to y would allow the buffer to get folded down to just the amount of overlap across y without any extra recomputation compared to the previous schedule:
However, as it can also be seen from Figure 4 (b), computing blurx at the innermost level of its consumer causes the loading of the input buffer to be much less efficient. In detail, since input is accessed in a column major order (three horizontal pixels at a time are needed to produce one pixel of blurx), prefetched (consecutive) cache lines will only be used after Ty iterations or will not even be used at all if Ty is too large, and they get evicted from the cache. As a result, the input load cost is now equal to:
where the T x factor can no longer be simplified, since accesses to input are not consecutive and the schedule does not benefit from hardware prefetching (as much as the previous one). For reference, the previous schedule (Figure 2(c) ) performs twice as fast compared to this one, even though it does not maximize folding. Based on the above (Equations (9) and (11) and Figure 3 ), we can conclude that folding the intermediate buffers leads to much smaller local memory requirements without sacrificing data reuse or increasing the amount of redundant computations. As a result, solutions that were previously not considered, e.g., tile sizes that led to large memory footprints can now easily be captured by separating the computation and storage of a stage. However, as seen by comparing Equations (13) and (6), maximum folding does not always ensure exploitation of the spatial locality or the hardware prefetching mechanisms of the platform. Due to this, a tradeoff analysis among reuse, recomputation, input loading cost, and memory requirements has to be conducted. Figure 5 shows an abstract representation of the design space when considering stage fusion. As already mentioned, previous state-of-the-art techniques only consider solutions that reside within a small area of this space. The Halide Auto-Scheduler only produces solutions where the compute granularity of a stage is the same as its storage granularity. As a result, the generated schedules are limited to fully inlined, fully stored, and tiled implementations with redundant computations where the computation and storage are set to the innermost inter-tile level (overlapping tiles). Figure 6 shows the distinct solutions for the above loop permutation of the blur example. The root and inlined solutions have been excluded due to limited reuse, parallelism, or locality as explained in the above example. We can notice that most solutions of the design space are currently not considered and all sliding window opportunities are missed.
Our method enables fast exploration of this new design space. We will show that through the use of heuristics, we can quickly prune the space down to a single solution (e.g., of the 10 valid schedules in Figure 6 , we only need to evaluate 1). This is achieved by automatically eliminating most uninteresting schedules that are pareto dominated by other more efficient solutions. Dominant schedules are considered the following:
(1) Schedules that offer more reuse with the same buffer requirements, e.g., consider the schedules (yi, yi) and (yi, xo). The second schedule provides more reuse while the sliding window optimization allows for the same memory requirements. (2) Schedules that offer the same amount of reuse with the smaller buffer requirements.
e.g., schedules (xo, xo) and (yi, xo) as explained in the previous example.
The final solution is then evaluated through a cost function to pick the tile sizes. The analysis has to be repeated for different loop permutations, since that leads to a different design space with different solutions.
PROPOSED METHOD
In this section, we present each major step in our optimization flow. We follow a grouping-thentiling technique that only attempts to split the pipeline into smaller segments if the initial solution does not fit within the memory constraints. More specifically, Section 4.1 discusses the algorithms responsible for choosing the compute and store level of a stage inside a pipeline (or a segment of a pipeline). Section 4.2 presents the tiling analysis that determines the proper tile sizes for a pipeline/segment. Section 4.3 demonstrates the procedure that is followed to split a pipeline into smaller segments. Some final optimizations are discussed in Section 4.4, and an overview of the optimization flow is given in Section 4.5
Fusion Strategy
This subsection introduces the analysis and heuristics that are used to determine which single point of the fusion space should be chosen for further evaluation. More specifically, this section addresses the problem of choosing the computation and allocation level of each stage inside a pipeline (or a segment of it). As already mentioned, our goal is to eliminate inefficient schedules without evaluating their costs. Algorithms 1 and 2 show the procedure that is followed to accomplish that. In detail, Algorithm 1 takes a pipeline (P) as an input that can be either the whole DAG of the initial pipeline or a sub-graph of it and identifies the compute and store level for each stage (K) in P. However, Algorithm 2 attempts to inline stages with trivial computational costs.
The pipeline can be described as a DAG of m connected nodes such that P = {K 0 , K 1 , . . . , K m }, where K m is the output/final stage of P. Furthermore, to be able to describe all necessary dependencies between the nodes of the DAG, as well as the schedule of each stage, we perform the following definitions for all i ≤ m:
. . xo n2 } that represents the tiled loop nest of K i where the xi and xo are the intra and inter-tile loop indices, respectively.
stage and I l = {E 0 , E 1 , . . . , E z }, z the number of unique indices in the loop nest of
, vϵN, while xϵD K l is the dimension where the dependency exists and v the amount of overlap.
• A tuple S i = (x comput e , x stor e ), xϵD m , which will partially define the final schedule and where x comput e and x stor e are indices of the output domain.
Algorithms 1 and 2 determine the compute and store level of a stage inside a pipeline. More specifically, Algorithm 1 first checks whether a stage has overlap with any of its consumers (whether any of its values can be reused across iterations). If that is true, then the algorithm searches for the dependency index with the highest intra-tile order. If that index is also present in the loop nest of the output stage and is not the innermost one, then it is set as the compute level of the stage. Its store level is set to one level higher to benefit from sliding-window opportunities and ensure that all possible reuse is captured as explained in Section 3. While there might be cases where maximum folding and therefore even smaller buffer requirements can only be obtained by either moving the store level higher or the compute level lower than what Algorithm 1 considers, the above heuristics allow us to quickly choose a single point in the design space while ensuring maximum reuse. Furthermore, if the chosen index corresponds to the innermost intra-tile loop of the loop nest or is a reduction dimension that offers full reuse, then the compute level is also set to one level higher. This decision is made to better exploit spatial locality and hardware prefetching in cases where the compute level is set to the column (as also explained in the motivational example) or vector index (which often corresponds to the innermost intra-tile loop) of a loop-nest and avoid redundant computation in reductions that have full overlap with their consumers. If, however, the chosen index is not found in the output loop nest, then it means that there is no direct overlap between the output domain and the stage that is being scheduled, but dependencies exist across intermediate stages. Its compute and store levels are therefore set to the innermost intra-tile loop of the output stage of that segment. The above method allows us to quickly choose a compute/store level for each stage of the group/segment. Furthermore, as explained in Section 3, moving the compute level even higher (to be the same as the store level) would lead to dominated solutions that require larger buffers only to achieve the same reuse. Finally, if a stage has zero overlap with its consumers, then its computation is inlined. We should note that all stages are scheduled with respect to the output stage of the pipeline. This eliminates any possibilities of nested loop fusion, which would add extra recomputation between the loops. We also introduce notation for two helper functions (select and next) where:
• select returns the first subset (or element) denoted by the first argument that belongs to the tuple (or pair) denoted by the second argument.
• next also takes two arguments and returns the element that belongs to the ordered set specified by the first argument and the position equal to the second argument plus one.
Algorithm 2 uses the partially defined output schedules of Algorithm 1 and attempts to inline the trivial stages of the pipeline. A stage is considered trivial only if its computational cost is equivalent to its load cost (similarly to the analysis followed by the Halide Auto-Scheduler) and only if all of its producers are non-inlined. After finding that a stage is trivial, the algorithm checks whether any of its direct producers that were previously inlined (due to zero overlap) may now have to be scheduled. In such a case, the compute and store levels of the newly found non-trivial stage is set to be the same as the ones of the now inlined stage.
Group Tiling
This section presents the analysis that chooses a proper tiling for a given pipeline/group. Algorithm 3 shows the procedure in detail.
The algorithm requires a pipeline (or segment) P as well as the linearly ordered set D m as inputs. The latter represents the ordering of the tiled loop nest of the output stage and can initially be any (arbitrary) permutation of the loop nest as long as the intra-tile loops (xi) do not mix with the inter-tile ones (xo). The cost of evaluating each stage without any recomputation (C root Ki , S Ki = (root, root )) is computed to be able to calculate the amount of recomputation for a given schedule. While this factor will be constant and will not alter the analysis within a group, it can affect the total cost of the pipeline when a different grouping is considered (and different stages have zero recomputation). As explained in the previous sections, the various discrete points in the fusion space depend on the loop permutation of the pipeline. As a result, the algorithm needs to try all possible intra and inter-tile loop permutations. Since the number of possible schedules explodes for large pipelines with multiple nested loops (such as convolutional neural networks), we do not attempt to interchange the kernels or other loops that only perform a few iterations. This decision We iterate over all possible tile sizes that fit into specific constraints:
• The tile size of the innermost intra-tile dimension (which is not part of the kernel) has to be a multiple of the cache-line size, as well as a multiple of the native vector width.
• The tile size of the outermost inter-tile dimension has to fulfill:
• The tile size in a dimension where a dependency exists has to be at least as large as the amount of maximum overlap in that dimension such that, if x is the dimension of interest then:
In detail the first constraint is imposed to maintain vectorization in conjunction with spatial locality across cache lines. The second constraint ensures that the final schedule will have enough parallelism to utilize the multi-threaded aspects of the target architecture. The third constraint avoids invalid tile sizes that would lead to redundant computations without extra buffer benefits (since due to the inter-stage dependencies the memory allocation would be at least equal to the amount of overlap anyway). Finally, we do not consider tiles where the total footprint of stages without folded storage (compute and store level are the same) does not fit into the L2 cache. This constraint ensures that values that will be immediately consumed and cannot be reused in future iterations stay local. We calculate the costs defined in Section 3 for each stage individually using a weighted cost function and sum them together to compute the total cost of the pipeline. Our cost function uses the load cost of a stage (Cl . . ,Tm n1 ) that minimizes the total cost (C total ) of the pipeline is chosen as the final schedule. 
ALGORITHM 4: Group Stages
Input: P, D m Output: H 0 , H 1 , . . . , H w n P ← 0 for all i in 0 ≤ i < m do if size (W i ) > 1 then S i = {root, root } H n P ← {K 0i , . . . , K ti , K i }, n P + + P ← erase ({K 0i , . . . , K ti , K i }) end end H n ← P for j in 0 ≤ j ≤ n P do SplitSeдment (H j ) end ALGORITHM 5: Split Segments Input: H Output: H 0 , H 1 , . . . , H w TilinдAnalysis (H ) if wset H > csize then n H ← max_split repeat S n H = {root, root } H n ← {K 0 , . . . , K n }, n H + + H ← erase ({K 0 , . . . , K n }) TilinдAnalysis (H n ) if wset H n < csize then SplitSeдment (H ) end until wset H n > csize end
Stage Grouping
If the memory footprint of the final schedule is larger than the size of the last-level-cache, then the pipeline is split into segments, and each segment is scheduled independently of the others. Given the fact that current multi-core architectures contain caches of many MBs in size that will likely fit many stages, our strategy attempts to reduce design time by only attempting to split the pipeline if the initial solution (where all stages are either fused or inlined into the output stage) does not fit into the cache. The memory footprint of the pipeline is equal to the amount of memory required/allocated for all intermediate stages of the pipeline (or segment). Data from intermediate stages will either be stored to be reused in future intra-tile iterations (in circular/folded buffers) or will immediately be consumed in the current intra-tile iteration and are not needed afterward. Buffers in the former category are folded down to the maximum amount of overlap (only in the dimension specified by the compute level of the stage), and their total size needs to fit into the last-level cache for future use, while buffers that will not get folded need to fit inside the L2 cache (such that their data stay local between production/consumption). All buffers are calculated based on the areas required (allocated) by the compiler for a given schedule. Algorithms 4 and 5 show the steps that are followed to split the pipeline P into non-overlapping segments (H 0 , H 1 , . . . , H w ). Algorithm 4 takes the initial pipeline as an input and first checks whether any stages have more than one consumers. In that case, these stages form a new pipeline, along with their producers, and are erased from the initial DAG. This is done to limit the design space and enable faster optimization runtime. While as a result we may end up end up with multiple smaller segments in some pipelines, we did not notice any significant performance degradation due to this fact. Further investigation of performance benefits that may be captured by merging those smaller segments is left as future work. During the next step, we attempt to schedule all new pipelines, along with the remainder of the previous step (remaining stages of the original pipeline). Algorithm 5 checks whether the footprint of the new segments is still larger than the available cache size and then recursively splits those into smaller segments using the following process. Starting from the nth stage of the pipeline, where n is set to max_split (an integer value that controls the minimum size of a segment), we schedule all stages up to the nth. If the segment fits, then we attempt to schedule the remaining stages by recursively repeating the same algorithm. After having evaluated all possible configurations for a specific n, we increase it by 1, and the process is repeated until the working set of the segment no longer fits. This ensures that we skip configurations with invalid segment sizes. Each (unique) valid solution generated by Algorithm 5 where all stages of the original pipeline (P) have been successfully scheduled is cached, and the sum of all independent sub-pipelines' cost is evaluated. The configuration that results in the minimum (summed) cost is chosen as the final solution. We should also note that if the initial value of max_split is set to 1, then the algorithm will evaluate all valid grouping configurations for the pipeline.
Evaluating all possible configurations may require an extensive amount of time for larger pipelines (such as deep neural networks). Our method reduces the runtime of the grouping process by eliminating non-interesting segmentations. This is achieved in two ways:
• Upon identifying that the memory footprint of a segment is larger than the available cache size, we do not attempt to fuse more stages into the same segment. This choice can be explained as follows: A segment with a memory requirement larger than the available cache size will only grow larger if more stages are included into it, especially if the newly included stage has extra dependencies.
• We do not attempt to split the final segment of a pipeline into smaller ones, since that would only add external load costs from the previous root stages to the subsequent consuming ones. This choice allows us to significantly reduce the time needed to find the final configuration, especially in the context of modern multiprocessor architectures with large cache sizes.
The steps followed for an example pipeline can be seen in Figure 7 .
Final Optimizations
Upon finding the final configuration of a pipeline, we have groups of stages with a specified tiling and loop permutation per group. We vectorize the innermost intra-tile loop of a group that is not part of a reduction (or a kernel) and parallelize its outermost inter-tile loop among the platform's threads/cores as explained in Section 4.2 (Equation (14)). However, we have not yet considered any changes in the permutation of individual stages within a stage. We optimize the loop nest of each producing stage within a segment through loop interchange that improves reuse distance by reordering loop indices with minimum strides to be innermost. Moreover, the loop that corresponds to the compute level of a stage is always set as outermost, since that loop will always iterate once (or once plus the equivalent overlap with its consumer) and would add extra loop overhead in any other position. Figure 9 shows the optimization flow for an input pipeline along with all iteration steps involved, while Figure 8 shows the steps followed to schedule each segment (or the initial whole pipeline if it fits in the cache). In detail, for all valid permutations of the tiled loop nest, Algorithm 3 calls Algorithms 1 and 2 to determine the compute/store levels of each stage. It then evaluates the total cost of the pipeline for all valid tile sizes and the combination of D m ,T m (loop permutation and tile sizes respectively) that minimizes C total is chosen. If the memory footprint of the final schedule is larger than the constraints imposed by the last-level cache of the target system, then Algorithms 4 and 5 are used to split the pipeline into smaller segments, with Algorithm 3 (and subsequently Algorithms 1 and 2) used again to schedule each new segment. Every valid configuration (where all stages of the original pipeline have been successfully scheduled) is cached to be evaluated at the end of the process. The segmentation/configuration that minimizes the total cost of the original pipeline is chosen as the final, now scheduled pipeline.
Optimization Flow Overview

EXPERIMENTAL RESULTS
This section demonstrates the results obtained across a wide variety of image-processing applications on three different architectures. 
Experimental Setup
The architectural details of each platform used in the experiments are listed in Table 2 . Table 3 provides a description of each benchmark along with the problem size considered as well as the optimization runtime. Most of the descriptions were found in Reference [15] . The chosen benchmarks include image-processing pipelines used in Reference [15] , the pyramid blending algorithm used in Reference [10] , as well as a popular recent Deep-Neural-Network used for single image super-resolution (VDSR) that was introduced in Reference [12] . All problem sizes are chosen to be the same as the ones found either on the official Halide repository on GitHub [7] or as the ones used in Reference [10] . The optimization flow of most benchmarks is automated, with the exception of three pipelines (that are marked with an asterisk). These are only partially automated due to complex reductions and inter-stage dependencies such as extensive helper-function usage (argmin, maximum, lerp, etc.) in these applications. As a result, some steps of the algorithms defined in Section 4 had to be manually implemented for these pipelines. However, such dependencies can easily be derived by the Halide compiler and therefore will easily be captured when our framework has access to all information available to the compiler. In the following graphs, the manual implementations refer to the manual schedules found in the Halide repository (the only exceptions being the pyramid benchmark, the manual schedule of which was found in Reference [10] , as well as the VDSR network that we implemented in Halide). The PolyMage-A and PolyMage-DP implementations refer to the results replicated using the artifacts and instructions provided by the authors of References [10] and [20] . However, implementations were provided for only six benchmarks, which are also the ones considered in Reference [10] .
We compare our results to the equivalent ones produced by the other methods: Since all of our applications are implemented in Halide, we can use the Halide Auto-Scheduler to produce schedules for all benchmarks. Each benchmark is executed 100 times, and the average execution time per run is measured. This process is repeated multiple times per benchmark and the minimum average among those is used as the final average execution time. Furthermore, we properly adjust the optimizations parameters of both the Auto-Scheduler and PolyMage before our experiments for the solutions to be tuned to the target platforms. Since PolyMage cannot explicitly vectorize loops (unlike Halide), the performance of the PolyMage implementations is highly influenced by the efficiency of the auto-vectorizer of the back-end compiler [10] . Finally, the problem size used in Reference [10] for the harris and unsharp benchmarks differs from the one in the Halide repository. We therefore repeat the experiments for this problem size as well and the results of this comparison can be seen in Table 4 . Halide was built using llvm 4.0.0, while the PolyMage implementations were compiled using icpc on the i7-6700 platform and gcc on the i7-5930K and ARM platforms.
At this point, it is important to emphasize that, as seen in Figures 8 and 9 , all of the proposed algorithms are tightly coupled. As explained in the motivational example ofs Section 3, sliding windows and circular buffers allow for tile sizes that would otherwise be impossible to consider (e.g., large tile strips that otherwise would never fit into the local buffer constraints imposed by the cache size). As a result, evaluating each algorithm independently is not possible; they should all be considered together. Fast bilateral filter using the bilateral grid [3] . Constructs the grid using a histogram reduction, followed by stencil and sampling operations. 4ms unsharp 6 stages 2,560 × 1,536 × 3 Enhances local contrast by smoothing an image with a small support Gaussian and subtracting it from the original to isolate the high-frequency content, which is then combined with the original image. The Frankencamera pipeline for processing raw data from an image sensor into a color image [1] . The pipeline performs hot-pixel suppression, demosaicing, color correction, gamma correction, and contrast. Given a rectified stereo pair of images, produces a synthetic shallow-depth-of-field image. It first solves for depth by constructing and filtering a cost volume [22] using a convolution pyramid [6] , then renders the synthetically defocused image by randomly sampling the source image over a virtual aperture.
(617ms)* nlmeans 13 stages 614 × 1,024 × 3 Fast non-local means image denoising using the method of [5] . Computes a 7×7 image blur with weights determined by 7 × 7 patch similarity (4ms)* Figure 10 shows the average execution time (in ms) for each benchmark on the two Intel platforms listed in Table 2 . The results for the harris and unsharp benchmarks on the problem size of the PolyMage implementations can be seen in Table 4 . Our schedules outperform the Auto-Scheduler solutions in almost all cases, with the Laplacian benchmark being the only exception on the Intel i7-5930K, where the difference in execution time is still within ≈2%. We noticed that while the initial Auto-Scheduler paper optimizes for L2 cache size, the currently used and updated one is targeting the shared last-level cache. We conducted multiple experiments for both choices and noticed that while some benchmarks experience a slight performance improvement when the memory footprint constraint is set to the size of L2, other ones suffer a dramatic performance degradation. As a result, we choose to use the currently advised method of optimizing for last-level cache in the results. Finally, our schedules are also comparable or even better than the manual ones in many cases.
Performance Results
PolyMage-DP performance is similar to the Auto-Scheduler in almost all cases. The constant updates and focus on the Auto-Scheduler by the Halide community may explain the difference in the results presented here and the ones in Reference [10] between the two methods. The efficiency of auto-tuned PolyMage-A solutions vary per benchmark and platform: The raw camera and bilateral grid implementations of the auto-tuned solutions on the intel i7-6700 are close to (or slightly better than) the manual Halide schedules. However, they are much less efficient compared to the other implementations of the harris filter on both Intel platforms.
The results for the ARM Cortex platform can be seen in Figure 11 and Table 4 , where a similar pattern can be discerned. Our schedules outperform both the manual and auto-scheduled ones with the largest differences observed in the interp, laplacian, and VDSR benchmarks. The performance of the PolyMage solutions varies per benchmark. For example, it performs significantly worse than the Halide solutions on the camera pipeline, slightly better than both the Auto-Scheduler and the manual Halide schedule on the interp benchmark, and much faster than all Halide solutions in the bilateral pipeline. The main reasons behind this result are the differences in the functional description of the pipeline between the Halide and PolyMage implementations: Halide uses a built-in linear interpolation function that performs more complex computations than the PolyMage implementation of it. Upon forcing Halide to use a simpler approach, performance was improved by up to 40% in all three cases (Auto-Scheduler, Manual, and Proposed). Furthermore, Halide requires all expressions used as indices in functions/stages to be bounded and therefore performs extra clamping in two stages for the compiler to be able and derive the bounds of the equivalent producers. These extra computations are the main bottlenecks in the performance of the Halide bilateral pipeline on the ARM platform. However, finding an efficient description is outside the scope of this work.
Finally, to test the efficiency of our grouping strategy (Algorithms 4 and 5), we repeat the experiments for the VDSR network and investigate the performance for various problem sizes. We choose VDSR, since it consists of sequential stages where each subsequent stage consumes the output of the previous one (except for the input image that is consumed twice). This benchmark is therefore a good candidate for such an experiment, since various problem sizes will lead to different tiling choices and therefore different memory footprints, which, due to a constant memory constraint will require new segmentations. The results of this experiment on the Intel i7-6700 platform for five different dimensions of the output image are presented in Figure 12 . Our schedules perform more than 2× better than the equivalent Auto-Scheduled solutions for large problem sizes.
To demonstrate the robustness of our method, the same experiment was conducted on another platform (with an Intel i7-6560U processor) once with the hardware prefetcher enabled and once with the hardware prefetcher disabled. The experiments followed a similar trend as in Figure 12 when comparing the two implementations. Furthermore, the performance degradation when the hardware prefetcher was disabled in our solutions was close to 20% while for the Auto-Scheduler solutions it was more than 2×. Upon further investigation, we noticed that the loop permutation chosen by the Auto-Scheduler (which attempts to reorder loops based on their stride, i.e., placing the loop with the smallest stride innermost) interleaves the column, row, and kernel dimensions, limiting the amount of spatial reuse that can be captured in the process. This incurs a high penalty when the hardware prefetchers are disabled. However, our proposed method does not reorder loops with low iterations (similar to the 3 × 3 convolution kernels) and only attempts to exploit prefetching when determining the tile size dimensions (Algorithm 3). Setting the kernel inner to the column and row dimensions allows data to stay in the local caches (or even registers) before they are reused. As a result, self-spatial reuse can still be exploited across kernel iterations, and this explains why our schedules do not suffer as much when hardware prefetchers are disabled.
Finally, based on the above results (Figures 10-12 ), we can observe that larger pipelines with multiple stages such as the interp, laplacian, and VDSR benchmarks benefit the most from our schedules, where sliding window opportunities are easily captured, buffers are folded down to smaller memory footprints, and new tiling opportunities are considered. Moreover, even in cases where the pipeline does not offer such opportunities (e.g., bilateral, nlmeans), our solutions remain similar to (or in many cases even better than) both the Auto-Scheduler, and the manual solutions.
CONCLUSIONS
In this work, we present a novel platform-aware algorithm for the optimization of imageprocessing pipelines running on multi-core CPU-based architectures. We show that our method captures solutions of the design space that were not covered in previous state-of-the-art techniques by effectively considering combinations of loop tiling, interchange, and stage fusion with independent computation and allocation per stage. Our model takes into account multiple architecturespecific parameters such as multithreading, vectorization, and hardware prefetching. We evaluate our proposed method across a variety of image-processing applications implemented in the Halide DSL and compiler and compare it to both previous state-of-the-art techniques that target the Halide and PolyMage DSLs, as well as manually optimized Halide solutions. Experimental results show significant average performance improvements compared to previous related work as well as the manually optimized implementations of Halide pipelines.
