INTRODUCTION
Todays' hardware diversity exacerbates the need for optimizing compilers. A problem that arises when exploiting hardware accelerators (FPGA, GPU, dedicated boards) is how to automatically perform kernel/function offloading or outlining (as opposed to function inlining). The principle is to outsource part of the computation (the kernel to be performed on the accelerator) to a more efficient but more specialized hardware. This requires static analysis to identify the kernel input (data read) and output (data produced) and code generation for the kernel itself, the associated transfers, and the synchronization with the rest of the code (on the host CPU). In general, such tasks are done by the developer who is required to explicit the communications, allocate and size the intermediate buffers, and segment the kernel into fitting chunks of computation. When a single kernel is offloaded in a three-phases process (i.e., upload, compute, store back), such programming remains feasible: for GPUs, the developers can use OpenCL or CUDA, or rely on higherlevel abstractions, such as the directives of OpenACC 1 or the garbage collector mechanisms of SPOC 2 . However, in some cases, it is necessary to decompose a kernel into a sequence of smaller kernels (to get blocking algorithms, thanks 1 OpenACC: http://www.openacc-standard.org/ 2 SPOC: http://mathiasbourgoin.github.io/SPOC/ Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage, and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s). Copyright is held by the author/owner(s). to loop tiling) that are optimized with pipelined communications and data reuse among blocks (tiles). The choice of tile sizes is driven by hardware capabilities such as memory bandwidth, memory size and organization, computational power, and such codes are extremely hard to obtain without automation and some cost model. The contribution supported by this abstract and the associated poster is a parametric (w.r.t. tile size) analysis technique to perform these steps, including inter-tile data reuse and pipelining, using polyhedral optimizations 3 . It has been presented at the IMPACT'14 workshop [2] .
Dependence analysis and code generation for loop tiling (for constant tile sizes) is well-established in the polyhedral model, i.e., for a set of nested for loops, writing and reading multi-dimensional arrays and scalar variables, where loop bounds, if conditions, and array access functions are affine expressions of surrounding loop counters and structure parameters. With similar assumptions, our work is a generalization, to the case of parametric tile sizes, of [1] , who showed the feasibility and efficiency of such a kernel offloading for FPGA, as a source-to-source process on top of Altera C2H HLS tool. Similar results with data reuse between two successive tiles only were demonstrated for AutoESL Xilinx tool [4] . "Guessing" the right size of the tiles can be laborious, especially when dealing with multi-level tiling and multi-level caches. The search space becomes so wide that even iterative compilation might not be sufficient.
PARAMETRIZATION
Our parametric technique provides a direct expression of the copy-in/copy-out sets for each tile. It can then be used for performing array contraction on the accelerator (exploiting the liveness of array cells for memory reuse), still in a parametric fashion. Such results are quite surprising as parametric tiling is often considered as necessarily involving quadratic constraints, i.e., not analyzable within the polyhedral model. We used a different way of reasoning that consists in considering tiles that are not actually executed: the ones that are not aligned with the grid. This allows to get rid of all of the quadratic constraints and we proved that if done well, this actually gives the exact expressions of copy-in/copy-out sets. Parametric code generation can then be handled with techniques such as [3] . Furthermore, we showed that this reasoning can also be extended in the case of approximations, which are needed when dealing with kernels that are not fully affine, or because approximations of communications are desired for code simplicity or 3 Polyhedral model: http://polyhedral.info architectural constraints (e.g., vector communication). The main difficulty with approximation is that, when data are updated in some of the kernels, loading blindly, in an overapproximate way, from main memory is not safe as the main memory is not up-to-date. We proved that, assuming sane constraints on the approximation scheme, we could perform the analysis using the same idea without loosing precision. We also provide a strategy to transform an unsafe approximation scheme into a sane one, without loosing too much in precision.
CONCLUSION
While this work is so far mostly at the level of code analysis, partial experiments on FPGA and GPU can be shown. Several steps remain to be completed, in particular with respect to approximation schemes, but the top stages (copyin/copy-out sets and array contraction) have been tested with the isl 4 calculator iscc. The perspectives of this work are numerous: (1) build cost models for memory transfers with data reuse to do the actual tile size selection, (2) design general schemes for approximations of data sets, (3) explore the link with array region analysis, (4) integrate on top of languages such as OpenACC, or compilers such as ppcg [5] . This paves the way to the automatic derivation of blocking algorithms for accelerators without the difficulty of having to choose optimized tile sizes a priori.
ACKNOWLEDGMENTS
I wish to thank my Ph.D advisor, doctor Alain Darte; as well as Sven Verdoolaege for his help in using isl and iscc 4 Integer set library: freecode.com/projects/isl and for his suggestion that set differences and relations could solve the non-parametric problem as efficiently as in [1] .
