A Quick Introduction to Functional Verification of Array-Intensive
  Programs by Banerjee, Kunal & Karfa, Chandan
ar
X
iv
:1
90
5.
09
13
7v
1 
 [c
s.P
L]
  2
2 M
ay
 20
19
A Quick Introduction to Functional Verification of
Array-Intensive Programs
Kunal Banerjee
Parallel Computing Lab
Intel Corporation
Bangalore, India
kunal.banerjee@intel.com
Chandan Karfa
Dept. of Computer Sc. & Engg.
IIT Guwahati
Guwahati, India
ckarfa@iitg.ac.in
Abstract—Array-intensive programs are often amenable to
parallelization across many cores on a single machine as well
as scaling across multiple machines and hence are well explored,
especially in the domain of high-performance computing. These
programs typically undergo loop transformations and arithmetic
transformations in addition to parallelizing transformations.
Although a lot of effort has been invested in improving paral-
lelizing compilers, experienced programmers still resort to hand-
optimized transformations which is typically followed by careful
tuning of the transformed program to finally obtain the optimized
program. Therefore, it is critical to verify that the functional
correctness of an original sequential program is not sacrificed
during the process of optimization. In this paper, we cover
important literature on functional verification of array-intensive
programs which we believe can be a good starting point for one
interested in this field.
Index Terms—formal verification, functional correctness,
array-intensive programs, optimization
I. INTRODUCTION
Recent days have seen a boost in high-performance comput-
ing due to the introduction of high performant GPUs and spe-
cially designed coprocessors, such as Xeon Phi. The primary
workloads which have benefited from these new hardware
are the array-intensive programs (also known as dataflow
programs). These programs are amenable to parallelization and
various application programming interfaces (APIs) exist which
help in exploiting the many cores available nowadays on a
machine, such as OpenMP [1] and OpenACC [2], and also
to parallelize across multiple machines, such as MPI [3] and
OpenCL [4]. However, although a lot of research has gone into
improving parallelizing compilers [5], [6], these compilers are
still not popular among experienced programmers who prefer
applying hand-crafted transformations to a sequential program,
followed by a tuning phase of the transformed program which
finally results in an optimally performing program. In such
a case, it is even more crucial to verify if the paralleliz-
ing transformation along with other optimizations, like loop
transformations and arithmetic transformations, preserve the
semantic equivalence of the sequential code. In this paper, we
present a short survey of the significant work done in the field
of functional verification of array-intensive programs.
Additionally, it is important to note that customized hard-
ware has received a lot of attention recently due to the rapid
proliferation of deep learning – be it coarse-grained recon-
figurable architecture (CGRA) [7], [8], FPGA [9] or systolic
array [10]. A lot of compiler frameworks have also come up
to provide support for the heterogeneous architecture that may
be used to train the neural networks, for example, Google’s
Tensorflow XLA [11], Amazon endorsed NNVM [12] and
Facebook’s Glow [13]. New programming methodologies are
also being proposed to further boost programmer productivity
by exploiting underlying architecture with minimal program-
ming overhead; the work reported in [14], for instance, intro-
duces an abstraction layer in the program called T2S (temporal
to spatial) which takes a temporal definition (basically, the
function to be computed) and generates its spatial mapping
(i.e., decompose the specified function and map the decom-
posed pieces onto a spatial architecture, e.g., CGRA or FPGA).
There is always a possibility that the compiler, e.g. [11]–[13],
may have an implementation bug or the specification provided,
say in T2S [14], has some inherent logical mistake because
of which the prescribed functionality cannot be efficiently
mapped to a given FPGA. We found that literature on verifica-
tion of specifically tailored transformations for mapping neural
networks onto heterogeneous architecture (which primarily
consists of the transformations discussed in this paper) is still
at a nascent stage. In particular, the work reported in [15]
offloads the onus of compiler transformation verification by
translating its internal representation (IR) to the well-known
LLVM [16] IR and then leveraging existing verification tech-
niques for LLVM compiler. Therefore, we believe that our
paper should also appeal to those who are interested in
designing and/or verifying compiler transformations directed
towards deep learning applications or artificial intelligence, in
general.
The rest of the paper is organized as follows. Section II
illustrates the benefits of applying loop transformations and
arithmetic transformations to array-intensive programs. Sec-
tion III covers the various significant work done for verifying
such programs with the help of an example. Section IV
concludes the paper.
II. OPTIMIZING TRANSFORMATIONS
A. Applications of loop transformations
Multimedia and signal processing applications have wit-
nessed extensive application of loop transformations and arith-
metic transformations. In the following, we study several
applications of loop transformations techniques during (em-
bedded) software design.
The effects of loop transformations on system power has
been studied extensively. The work reported in [17] underlines
the effect of loop fusion, loop fission, loop unrolling, loop
tiling, and scalar expansion on energy consumption. The
futility of applying conventional data locality oriented code
transformations for minimizing disk power consumption has
been showcased in [18]. As a remedy, the authors of [18] sug-
gest how both code restructuring and data locality optimization
should be taken into consideration for designing a disk layout
aware application optimization strategy. Specifically, the au-
thors focus on three optimizations – loop fusion/fission, loop
tiling and linear optimizations – for code restructuring and ad-
vocate a unified optimizer that targets disk power management
by applying these transformations. In [19], the authors focus
on an MPSoC architecture with a banked memory system.
For this architecture, they demonstrate how code and data
optimizations assist in reducing memory energy consumption
for embedded applications with regular data access patterns.
The work in [20] achieves minimization of the data memory
requirements of processors by using a memory-conscious loop
parallelization strategy. A data space-oriented tiling (DST) ap-
proach is proposed in [21] whereby the data space is logically
divided into chunks called data tiles. DST has the potentiality
to achieve better results than conventional loop tiling because it
exploits inter-nest data locality since the data space is common
across all loop nests that access it. A global approach to tackle
data locality problem is prescribed in [22] which evaluates all
the loop nests in an application to be run in an embedded
MPSoC simultaneously and schedules the different constituent
modules accordingly for parallel execution. In the context
of an embedded chip multiprocessor, the method described
in [23] underlines how reliability against transient errors can
be enhanced without sacrificing execution time by replicating
some of the operations being executed on active processors
onto (otherwise) idle processors.
Loop transformations have found application in the design
of system memory as well. For example, the authors of [24]
explain a method in the context of multimedia applications
running on MPSoCs that can reduce cache misses and also
cache size. Specifically, loop fusion and loop tiling are em-
ployed to minimize cache misses, whereas a novel buffer
allocation strategy is used to reduce cache size. This work is
extended in [25] to handle dependence-free arrays additionally.
Here an input-conscious tiling scheme for off-chip memory
access optimization is proposed. The authors showcase that
the input arrays play an as important role as the arrays with
data dependencies when the objective is memory access opti-
mization instead of parallelism extraction. Data reuse is a key
process that may potentially reduce external memory access
by exploiting the memory hierarchy. Loop transformations for
data locality and memory hierarchy allocation are important
procedures in the optimization flow for data reuse. A global
approach that provides optimal results on external memory
bandwidth and on-chip data reuse buffer size by combining
loop transformations and memory hierarchy allocation can
be found in [26]. An extension of this work is presented
in [27] that optimizes on-chip memory allocation by loop
transformations in the imperfectly nested loops. A dynamic
loop tiling strategy is proposed in [28], [29] to enhance cache
locality and obtain coarse-grained parallelism.
In [30], the authors undertake the challenge of reducing the
total energy while maintaining the performance requirements
for application with multi-dimensional nested loops. They have
demonstrated that an adaptive loop parallelization strategy
along with idle processor shut down and pre-activation can be
critical in minimizing energy consumption without increasing
execution time. The objective of the paper [31] is also the
same as that of [30]. However, they apply loop fusion and
multi-functional unit scheduling techniques to achieve that.
Loops containing nested conditional blocks can pose serious
challenge to compilers while producing optimized code. This
problem is tackled in [32]. This work statically analyzes the
Boolean conditions appearing at branching states in the control
flow of a program using a novel interval analysis technique.
The outputs of interval analysis integrated with those of loop
dependency are used to segregate the iteration space of the
nested loops.
A survey on application of loop transformations in data
and memory optimization in embedded system can be found
in [33], whereas the benefits of these transformations outlined
in [34] are targeted for a more general software design. For
details on some of the pioneering work on program transfor-
mations targeting reduction of energy consumption in dataflow
programs, one may refer to [35], [36]. Another work [37] tries
to reduce the use of temporary arrays, which may eventually
result in better register usage, by using loop fusion technique in
multimedia applications before hardware/software partitioning
is carried out. Loop transformations have also been applied to
improve performance in coarse-grained reconfigurable archi-
tecture [38]. Applications of loop transformations to parallelize
sequential code targeting embedded multi-core systems are
given in [39], [40]. Interested readers are encouraged to look
into [41]–[45] for several other loop transformation techniques
and their effects on system design.
B. Applications of arithmetic transformations
Compiler optimizations often involve several arithmetic
transformations based on algebraic properties of the oper-
ator such as associativity, commutativity and distributivity,
arithmetic expression simplification, constant folding, common
sub-expression elimination, renaming, dead code elimination,
copy propagation and operator strength reduction, etc. For
example, the work [46] demonstrates how applying retiming,
algebraic and redundancy manipulation transformations can
drastically improve the performance of embedded systems.
The authors of [47] investigate source-to-source algebraic
transformations which aid in decreasing the execution time
of expression evaluation; its benefit on performance has
been recorded on many computationally intensive applications.
They, basically, propose two algorithms based on factorization
and multiply-add extraction heuristics to replace traditional
associative commutative pattern-matching techniques. Another
method that minimizes operation cost based on loop-invariant
code motion and operator strength reduction is reported
in [48]. An in-depth experimental analysis on the effectiveness
of such source-level transformations at the level of number of
execution cycles, before and after applying the optimizations,
is given in [48] for two real-life multimedia application
kernels. Application of algebraic transformations to minimize
critical path length in the domain of computationally inten-
sive applications is proposed in [49]. Apart from standard
algebraic transformations such as commutativity, associativity
and distributivity, they also introduce two hardware related
transformations based on operator strength reduction and con-
stant unfolding. A set of transformations such as common
sub-expression elimination, renaming, dead code elimination
and copy propagation are applied along with code motion
transformations in the pre-synthesis and scheduling phase
of high-level synthesis in the SPARK tool [50], [51]. The
potential of arithmetic transformations on FPGAs is studied
in [52]. It has been shown that operator strength reduction
and storage reuse reduce the area of the circuit and hence the
power consumption in FPGA. The transformations like height
reduction and variable renaming reduce the total number of
clock cycles required to execute the programs in FPGAs,
whereas expression splitting and resource sharing reduce the
clock period of the circuits.
III. VERIFICATION OF OPTIMIZING TRANSFORMATIONS
Let us consider the two functions shown in Figure 1. The
function shown in Figure 1(b) has been obtained from that of
Figure 1(a) by applying the following transformations.
Application of loop transformation: The for loop in
program1 has been split into two (an instance of loop
fission).
Application of arithmetic transformations: The statements S0
and S1 in program1 have been merged into one statement
T0 in program2 – it is regarded as a linear arithmetic
transformation. Moreover, distributive property of multiplica-
tion over addition has been applied to reduce the number of
temporary array variables from two (I1 and I2) to one (I) –
it is regarded as a non-linear arithmetic transformation because
it involves multiplication of two array variables.
Application of parallelizing transformations: OpenMP direc-
tives have been introduced in program2 to make the execu-
tion of its for loops parallel.
Also note that the functions contain recurrence because the
array C is defined in terms of previously defined elements of
the same array. Now let us go through the various methods
reported in literature for verifying these individual transforma-
tions or some combination of these.
Verification of loop transformations on array-intensive pro-
grams is a well studied problem. Some of these target trans-
formation specific verification rules. The techniques reported
in [53], [54], for example, proposed permutation rules for veri-
fying loop tiling, loop reversal, loop skewing, loop interchange
transformations in their translation validation approach. The
rule set is further enhanced in [55], [56]. The primary issue
with this approach is that the method had to rely on the
hint provided by the compiler. The verifier needs the list
of transformations that have been applied and the order in
which they have been applied from the synthesis tool. Also,
completeness of the verifier depends on the completeness of
the rule set and therefore enhancement of the repository of
transformations necessitates enhancement of the rule set.
The concept of fractal symbolic analysis is introduced
in [57]. The idea is to reduce the gap between the source
and the transformed programs by applying simplification rules
repeatedly until the two programs become similar enough to
allow a proof by symbolic analysis. The rules are similar to
the ones proposed by [53], [54]. This method combines some
of the power of symbolic analysis with the tractability of
dependence analysis. The applicability of this technique again
depends on the robustness of the provided rule set.
The design of a fully automatic verifier for loop transfor-
mations can be found in [58]. Preservation of data depen-
dencies between the original and the transformed programs
at a statement-level forms the central idea in this work. This
method, however, does not have provision to handle arithmetic
transformations. Since it is common that the arithmetic trans-
formations and the loop transformations are applied together,
direct correspondence between the statement classes of the
original and the transformed programs does not always hold
as necessitated by [58].
Off-the-shelf SMT solvers, such as CVC4 [59], Yices [60],
or theorem provers, such as ACL2 [61], have also been
demonstrated to verify loop transformations and arithmetic
transformations [62]. It is more or less straightforward to
model the equivalence of two programs with a formula. The
validity of the formula can be checked by a SMT solver or
theorem prover [62]; if the formula is found to be valid then
the two programs are indeed equivalent. It is to be noted
that although the SMT solvers and the theorem prover can
be effective in handling linear arithmetic, presence of non-
linear arithmetic often makes these tools falter in proving
the equivalence; in such scenarios, these tools either output
“unknown” indicating that they failed to prove either equiv-
alence or non-equivalence of the programs or they time out
without producing an output. It has been shown in [62] that
state-of-the-art SMT solvers failed to verify most of the loop
transformations and arithmetic transformations.
The method developed in [63]–[66] assesses a restricted
class of programs with affine indices and bounds, static
control flow, single assignment form and valid schedule.
This method presents a translation validation algorithm for
void program1(
int A[], int B[], int C[]) {
int i, I1[500], I2[500];
S0: C[0] = A[0] + 2;
S1: C[0] += B[0] + 2;
for (int i=1; i<N; i++) {
S2: I1[i] = A[i] * C[i-1];
S3: I2[i] = B[i] * C[i-1];
S4: C[i] = I1[i] + I2[i];
} }
(a)
void program2(
int A[], int B[], int C[]){
int i, I[500];
T0: C[0] = A[0] + B[0] + 4;
#pragma omp parallel for
for (i=1; i<N; i++) {
T1: I[i] = A[i] + B[i];
}
#pragma omp parallel for
for (i=1; i<N; i++) {
T2: C[i] = I[i] * C[i-1];
} }
(b)
Fig. 1. (a) Original sequential program. (b) Transformed parallel program.
verifying loop transformations, where the source and the
transformed programs are modeled as Array Data Dependence
Graphs (ADDGs). This method is promising since it is capable
of handling most loop transformations without requiring any
additional information from the compilers (or human expert).
The primary limitations of this ADDG based verification
technique are its inability to handle recurrences and arithmetic
transformations. The ADDG based method is extended in two
directions initially. In one direction, Verdoolaege et al. in [67],
[68] enhanced the method to handle recurrences in programs.
In another direction, Karfa et al. [69], [70] enhanced it to
handle arithmetic transformations.
Specifically, in [67], [68], the authors have modelled the
programs as dependence graphs (DGs). Before delving into its
details, note that a recurrence involves induction case(s), e.g.
statement T2 in Figure 1(b), whereby the members of an array
is defined in terms of previously defined members of itself,
and basis case(s), e.g. statement T1 in Figure 1(b). Typically,
proving the basis cases in the DGs obtained from the original
and the transformed programs is straightforward involving
symbolic substitution of the temporary arrays by input arrays.
However, proving equivalence between the induction cases can
be convoluted and hence this verification procedure takes an
optimistic approach and initially considers that the induction
cases in the two DGs are possibly equivalent and proceeds in
a forward pass (using an operation called widening); the proof
obligations that remain pending during the forward pass are
tried to be resolved during a backward pass (using an inverse
operation called narrowing). This two pass approach adopted
in [67], [68] is found to be effective to handle a wide range
of programs containing recurrences. In fact, this method has
been proved to be successful in verying realistic multimedia
systems in [71].
In other direction, a slice-level equivalence of ADDGs is
proposed in [69], [70] (as opposed to path-level equivalence
of [65], [66]), to handle arithmetic transformations such as,
constant unfolding, common sub-expression elimination, dis-
tributive transformations, arithmetic expression simplification,
etc., along with loop transformations. This method additionally
incorporates a normalization technique [72] extended suitably
to represent data transformations. It has also been adopted in
checking correctness of transformations on the Kahn process
networks (KPNs) for multimedia and signal processing appli-
cations [73]. Handling recurrences, however, remains as the
main limitation of this technique. Recurrences create cycles
in the ADDG representation of a program. In the presence of
loops in an ADDG, obtaining the closed-form representations
of the data dependence is hard. The work reported in [74],
for the first time, proposes a unified method to verify loop
and arithmetic transformation as well as recurrence. In [74],
the verifier first identifies cyclic subgraphs in the ADDGs
(arising from recurrences), then for each cyclic subgraph in the
original ADDG, it tries to find an equivalent cyclic subgraph
in the transformed ADDG; if one-to-one correspondence is
found for all the cyclic subgraphs in the two ADDGs, then
the entire ADDGs, with the cyclic subgraphs replaced by
some uninterpreted functions (and thus reduced to cycle-free
ADDGs), are compared in an identical way as mentioned
in [70].
None of the above methods, however, handle parallelizing
transformations and loop vectorization. An early work [75]
proposes a static checker for analysis of a number of thread
synchronization issues. Another work [76] proposes a sim-
ilar approach for synthesizing synchronization implementa-
tion from a high level specification. However, the methods
described in [75], [76] do not explicitly target loop transfor-
mations and cannot handle arithmetic transformations at all.
The method of [74] was extended in [77] to handle data races
which may arise on introducing parallelizing transformations.
In [77], the authors model the programs as coloured-ADDGs.
Specifically, nodes with different colours are maintained to
capture the different parallelized regions of the program and
if some thread originating in some coloured region of the
program is found to be able to enter a differently coloured re-
gion of the program (signifying that no synchronization barrier
exists between these two regions), then the method declares
that parallelized program as unsafe because data race exists in
that program. In [78], a method is proposed to detect if a loop
can be parallelized or not. The authors have used seperation
logic to detect that. In addition, the method also identifies the
synchronization points which is required for parallel programs.
Dutta et. al [79], [80] has extended the work of Verdoolaege
et. al. in [67], [68] for verification of loop vectorization and
loop parallelization transformations. Specifically, they have
shown how to construct a DG from the loop vectorized and the
loop parallelized programs. The proposed work then applies
the DG based method proposed in [67], [68] to check the
equivalence. They have also enhanced the method proposed
in [67], [68] to handle loop collapsing transformation.
IV. CONCLUSION
Table I provides a list of the pros and cons of several key lit-
erature mentioned here. It is worth noting that while supporting
as many transformations as possible is a desired property for a
method, needing hints from the compiler to do so is typically
considered to be undesirable because it requires explicit in-
strumentation for each compiler (synthesizing tool); moreover,
a human expert may not always document one’s intuition
for applying a transformation methodically. Although [74]
tries to bridge the gap between handling recurrences and
arithmetic transformations, the treatment of recurrence in [67],
[68] is more robust, for example, the method of [68] can
additionally handle cases of mutual recurrence (also known
as, co-induction). The arithmetic transformations are elegantly
handled in [69], [70]. Therefore, it would be an interesting
future work to apply the normalization techniques proposed
in [69], [70] on the DG based method proposed in [67],
[68] to handle both recurrence and arithmetic transformations
elegantly. In addition, the work presented in [79], [80] can
be used prior to this method to handle loop vectorization and
loop parallelization transformations. We further envision that
the verification techniques discussed here are also effective in
verifying transformations applied by compilers targeting deep
learning applications.
REFERENCES
[1] “OpenMP,” https://www.openmp.org/, [Online accessed: 5-Feb-2019].
[2] “OpenACC,” https://www.openacc.org/, [Online accessed: 5-Feb-2019].
[3] “Message Passing Interface (MPI),”
https://computing.llnl.gov/tutorials/mpi/ , [Online accessed: 5-Feb-
2019].
[4] “OpenCL Overview,” https://www.khronos.org/opencl/, [Online ac-
cessed: 5-Feb-2019].
[5] W. Blume, R. Eigenmann, K. Faigin, J. Grout, J. Hoeflinger, D. A.
Padua, P. Petersen, W. M. Pottenger, L. Rauchwerger, P. Tu, and
S. Weatherford, “Polaris: Improving the effectiveness of parallelizing
compilers,” in LCPC, 1994, pp. 141–154.
[6] U. Bondhugula, A. Hartono, J. Ramanujam, and P. Sadayappan, “A
practical automatic polyhedral parallelizer and locality optimizer,” in
PLDI, 2008, pp. 101–113.
[7] S. M. A. H. Jafri, T. N. Gia, S. Dytckov, M. Daneshtalab, A. Hemani,
J. Plosila, and H. Tenhunen, “Neurocgra: A CGRA with support for
neural networks,” in HPCS, 2014, pp. 506–511.
[8] M. Tanomoto, S. Takamaeda-Yamazaki, J. Yao, and Y. Nakashima, “A
CGRA-based approach for accelerating convolutional neural networks,”
in MCSoC, 2015, pp. 73–80.
[9] Y. Ma, Y. Cao, S. B. K. Vrudhula, and J. Seo, “Optimizing the
convolution operation to accelerate deep neural networks on FPGA,”
IEEE Trans. VLSI Syst., vol. 26, no. 7, pp. 1354–1367, 2018.
[10] A. Samajdar, Y. Zhu, P. N. Whatmough, M. Mattina, and T. Krishna,
“SCALE-Sim: Systolic CNN accelerator,” CoRR, vol. abs/1811.02883,
2018. [Online]. Available: http://arxiv.org/abs/1811.02883
[11] “XLA Overview,” https://www.tensorflow.org/xla/overview, [Online ac-
cessed: 6-Feb-2019].
[12] “NNVM compiler: Open compiler for AI frameworks,”
https://tvm.ai/2017/10/06/nnvm-compiler-announcement.html, [Online
accessed: 6-Feb-2019].
[13] N. Rotem, J. Fix, S. Abdulrasool, S. Deng, R. Dzhabarov, J. Hegeman,
R. Levenstein, B. Maher, N. Satish, J. Olesen, J. Park, A. Rakhov,
and M. Smelyanskiy, “Glow: Graph lowering compiler techniques
for neural networks,” CoRR, vol. abs/1805.00907, 2018. [Online].
Available: http://arxiv.org/abs/1805.00907
[14] H. Rong, “Programmatic control of a compiler for generating high-
performance spatial hardware,” CoRR, vol. abs/1711.07606, 2017.
[Online]. Available: http://arxiv.org/abs/1711.07606
[15] R. Wei, V. S. Adve, and L. Schwartz, “DLVM: A modern compiler
infrastructure for deep learning systems,” CoRR, vol. abs/1711.03016,
2017. [Online]. Available: http://arxiv.org/abs/1711.03016
[16] “The llvm compiler infrastructure,” http://llvm.org/.
[17] M. Kandemir, N. Vijaykrishnan, M. J. Irwin, and W. Ye, “Influence of
compiler optimizations on system power,” IEEE Trans. Very Large Scale
Integr. Syst., vol. 9, pp. 801–804, 2001.
[18] M. Kandemir, S. W. Son, and G. Chen, “An evaluation of code and data
optimizations in the context of disk power reduction,” in ISLPED, 2005,
pp. 209–214.
[19] M. T. Kandemir, “Reducing energy consumption of multiprocessor SoC
architectures by exploiting memory bank locality,” ACM Trans. Des.
Autom. Electron. Syst., vol. 11, no. 2, pp. 410–441, 2006.
[20] L. Xue, O. Ozturk, and M. Kandemir, “A memory-conscious code
parallelization scheme,” in DAC, 2007, pp. 230–233.
[21] I. Kadayif and M. T. Kandemir, “Data space-oriented tiling for enhanc-
ing locality,” ACM Trans. Embedded Comput. Syst., vol. 4, no. 2, pp.
388–414, 2005.
[22] F. Li and M. T. Kandemir, “Locality-conscious workload assignment for
array-based computations in MPSOC architectures,” in DAC, 2005, pp.
95–100.
[23] G. Chen, M. Kandemir, and F. Li, “Energy-aware computation dupli-
cation for improving reliability in embedded chip multiprocessors,” in
ASP-DAC, 2006, pp. 134–139.
[24] Y. Bouchebaba, B. Girodias, G. Nicolescu, E. M. Aboulhamid, B. Lav-
igueur, and P. G. Paulin, “MPSoC memory optimization using program
transformation,” ACM Trans. Design Autom. Electr. Syst., vol. 12, no. 4,
2007.
[25] C. Zhang and F. Kurdahi, “Reducing off-chip memory access via stream-
conscious tiling on multimedia applications,” Int. J. Parallel Program.,
vol. 35, no. 1, pp. 63–98, 2007.
[26] J. Cong, P. Zhang, and Y. Zou, “Combined loop transformation and
hierarchy allocation for data reuse optimization,” in ICCAD, 2011, pp.
185–192.
[27] ——, “Optimizing memory hierarchy allocation with loop transforma-
tions for high-level synthesis,” in DAC, 2012, pp. 1233–1238.
[28] S. Tavarageri, L. Pouchet, J. Ramanujam, A. Rountev, and P. Sadayap-
pan, “Dynamic selection of tile sizes,” in HiPC, 2011, pp. 1–10.
[29] S. Tavarageri, J. Ramanujam, and P. Sadayappan, “Adaptive parallel tiled
code generation and accelerated auto-tuning,” IJHPCA, vol. 27, no. 4,
pp. 412–425, 2013.
[30] M. Karakoy, “Optimizing array-intensive applications for on-chip mul-
tiprocessors,” IEEE Trans. Parallel Distrib. Syst., vol. 16, no. 5, pp.
396–411, 2005.
[31] M. Qiu, E. H.-M. Sha, M. Liu, M. Lin, S. Hua, and L. T. Yang, “Energy
minimization with loop fusion and multi-functional-unit scheduling for
multidimensional DSP,” J. Parallel Distrib. Comput., vol. 68, no. 4, pp.
443–455, 2008.
[32] M. Ghodrat, T. Givargis, and A. Nicolau, “Optimizing control flow in
loops using interval and dependence analysis,” Design Automation for
Embedded Systems, vol. 13, pp. 193–221, 2009.
[33] P. R. Panda, F. Catthoor, N. D. Dutt, K. Danckaert, E. Brockmeyer,
C. Kulkarni, A. Vandercappelle, and P. G. Kjeldsberg, “Data and
memory optimization techniques for embedded systems,” ACM Trans.
Des. Autom. Electron. Syst., vol. 6, pp. 149–206, April 2001.
[34] D. F. Bacon, S. L. Graham, and O. J. Sharp, “Compiler transformations
for high-performance computing,” ACM Comput. Surv., vol. 26, pp. 345–
420, 1994.
[35] F. Catthoor, E. D., and S. S. Greff, HICSS. Custom Memory Manage-
ment Methodology: Exploration of Memory Organisation for Embedded
Multimedia System Design. Kluwer Academic Publishers, 1998.
[36] M. Palkovic, F. Catthoor, and H. Corporaal, “Trade-offs in loop trans-
formations,” ACM Trans. Design Autom. Electr. Syst., vol. 14, no. 2,
2009.
[37] A. Fraboulet, K. Kodary, and A. Mignotte, “Loop fusion for memory
space optimization,” in ISSS, 2001, pp. 95–100.
TABLE I
A COMPARISON AMONG DIFFERENT METHODS BASED ON TRANSFORMATIONS SUPPORTED
Lit Need hint Loop Recur Arith Vector Parallel
[56] X X × × × ×
[57] X X × × × ×
[66] × X × × × ×
[59]–[61] × X × Linear × ×
[81] × X X × × ×
[70] × X × X × ×
[74] × X X X × ×
[77] × X X X × X
[80] × X X × X X
[38] D. Liu, S. Yin, L. Liu, and S. Wei, “Polyhedral model based mapping
optimization of loop nests for cgras,” in DAC, 2013, pp. 19:1–19:8.
[39] S. Prema, R. Jehadeesan, B. Panigrahi, and S. Satya Murty, “Dependency
analysis and loop transformation characteristics of auto-parallelizers,” in
Parallel Computing Technologies, 2015, pp. 1–6.
[40] H. Yviquel, A. Sanchez, P. Ja¨a¨skela¨inen, J. Takala, M. Raulet, and
E. Casseau, “Embedded multi-core systems dedicated to dynamic
dataflow programs,” Signal Processing Systems, vol. 80, no. 1, pp. 121–
136, 2015.
[41] T. Sˇimunic´, L. Benini, G. De Micheli, and M. Hans, “Source code
optimization and profiling of energy consumption in embedded systems,”
in International Symposium on Systems Synthesis, 2000, pp. 193–198.
[42] Y. Zhu, G. Magklis, M. L. Scott, C. Ding, and D. H. Albonesi, “The
energy impact of aggressive loop fusion,” in PACT, 2004, pp. 153–164.
[43] C. Brandolese, W. Fornaciari, F. Salice, and D. Sciuto, “Analysis and
modeling of energy reducing source code transformations,” in DATE,
2004, pp. 306–311.
[44] M. Ghodrat, T. Givargis, and A. Nicolau, “Control flow optimization in
loops using interval analysis,” in Proceedings of the 2008 international
conference on Compilers, architectures and synthesis for embedded
systems, 2008, pp. 157–166.
[45] H. Falk and P. Marwedel, Source code optimization techniques for data
flow dominated embedded software. Kluwer, 2004.
[46] M. Potkonjak, S. Dey, Z. Iqbal, and A. C. Parker, “High performance
embedded system optimization using algebraic and generalized retiming
techniques,” in ICCD, 1993, pp. 498–504.
[47] J. Zory and F. Coelho, “Using algebraic transformations to optimize
expression evaluation in scientific codes,” in PACT, 1998, pp. 376–384.
[48] S. Gupta, R. K. Gupta, M. Miranda, and F. Catthoor, “Analysis of high-
level address code transformations for programmable processors,” in
DATE, 2000, pp. 9–13.
[49] B. Landwehr and P. Marwedel, “A new optimization technique for
improving resource exploitation and critical path minimization,” in ISSS,
1997, pp. 65–72.
[50] S. Gupta, N. Dutt, R. Gupta, and A. Nicolau, “SPARK: A high-level syn-
thesis framework for applying parallelizing compiler transformations,”
in VLSI Design, 2003, pp. 461–466.
[51] S. Gupta, R. Gupta, N. Dutt, and A. Nicolau, “Coordinated parallelizing
compiler optimizations and high-level synthesis,” ACM Trans. on Design
Autom. of Electr. Syst., vol. 9, no. 4, pp. 1–31, October 2004.
[52] A. P. N. E. O¨zer and D. Gregg, “Classification of compiler optimizations
for high performance, small area and low power in fpgas,” Trinity
College, Dublin, Ireland, Department of Computer Science, Tech. Rep.,
2003.
[53] L. Zuck, A. Pnueli, Y. Fang, and B. Goldberg, “VOC: A translation
validator for optimizing compilers,” Journal of Universal Computer
Science, vol. 9, no. 3, pp. 223–247, 2003.
[54] L. D. Zuck, A. Pnueli, B. Goldberg, C. W. Barrett, Y. Fang, and Y. Hu,
“Translation and run-time validation of loop transformations,” Formal
Methods in System Design, vol. 27, no. 3, pp. 335–360, 2005.
[55] Y. Hu, C. W. Barrett, B. Goldberg, and A. Pnueli, “Validating more loop
optimizations,” Electr. Notes Theor. Comput. Sci., vol. 141, no. 2, pp.
69–84, 2005.
[56] C. W. Barrett, Y. Fang, B. Goldberg, Y. Hu, A. Pnueli, and L. D. Zuck,
“TVOC: A translation validator for optimizing compilers,” in CAV, 2005,
pp. 291–295.
[57] V. Menon, K. Pingali, and N. Mateev, “Fractal symbolic analysis,” ACM
Trans. Program. Lang. Syst., vol. 25, no. 6, pp. 776–813, 2003.
[58] H. Samsom, F. Franssen, F. Catthoor, and H. De Man, “System level ver-
ification of video and image processing specifications,” in International
Symposium on Systems Synthesis, 1995, pp. 144–149.
[59] “CVC4 - the smt solver,” http://cvc4.cs.nyu.edu/web/ .
[60] “The Yices SMT Solver,” http://yices.csl.sri.com/.
[61] “ACL2 Version 6.1,” http://www.cs.utexas.edu/∼moore/acl2/.
[62] C. Karfa, K. Banerjee, D. Sarkar, and C. Mandal, “Experimentation with
SMT solvers and theorem provers for verification of loop and arithmetic
transformations,” in I-CARE, 2013, pp. 3:1–3:4.
[63] K. C. Shashidhar, M. Bruynooghe, F. Catthoor, and G. Janssens, “Geo-
metric model checking: An automatic verification technique for loop and
data reuse transformations,” Electronic Notes in Theoretical Computer
Science, vol. 65, no. 2, pp. 71–86, 2002.
[64] ——, “Verification of source code transformations by program equiva-
lence checking,” in Compiler Construction, 2005, pp. 221–236.
[65] ——, “Functional equivalence checking for verification of algebraic
transformations on array-intensive source code,” in DATE, 2005, pp.
1310–1315.
[66] K. C. Shashidhar, “Efficient automatic verification of loop and data-flow
transformations by functional equivalence checking,” Ph.D. dissertation,
Katholieke Universiteit Leuven, 2008.
[67] S. Verdoolaege, G. Janssens, and M. Bruynooghe, “Equivalence check-
ing of static affine programs using widening to handle recurrences,” in
CAV, 2009, pp. 599–613.
[68] ——, “Equivalence checking of static affine programs using widening
to handle recurrences,” ACM Trans. Program. Lang. Syst., vol. 34, no. 3,
2012.
[69] C. Karfa, K. Banerjee, D. Sarkar, and C. Mandal, “Equivalence checking
of array-intensive programs,” in ISVLSI, 2011, pp. 156–161.
[70] ——, “Verification of loop and arithmetic transformations of array-
intensive behaviours,” IEEE Trans. on CAD of ICS, vol. 32, no. 11,
pp. 1787–1800, 2013.
[71] S. Verdoolaege, M. Palkovic, M. Bruynooghe, G. Janssens, and
F. Catthoor, “Experience with widening based equivalence checking in
realistic multimedia systems,” J. Electronic Testing, vol. 26, no. 2, pp.
279–292, 2010.
[72] D. Sarkar and S. De Sarkar, “A theorem prover for verifying iterative
programs over integers,” IEEE Trans Software. Engg., vol. 15, no. 12,
pp. 1550–1566, 1989.
[73] C. Karfa, D. Sarkar, and C. A. Mandal, “Verification of KPN level
transformations,” in VLSI Design, 2013, pp. 338–343.
[74] K. Banerjee, C. Mandal, and D. Sarkar, “Translation validation of
loop and arithmetic transformations in the presence of recurrences,” in
LCTES, 2016, pp. 31–40.
[75] C. Flanagan, S. N. Freund, and S. Qadeer, “Thread-Modular Verification
for Shared-Memory Programs,” Programming Languages and Systems,
LNCS, vol. 2305, pp. 262–277, 2002.
[76] X. Deng, M. B. Dwyer, J. Hatcliff, and M. Mizuno, “Invariant-based
Specification, Synthesis, and Verification of Synchronization in Concur-
rent Programs,” in ICSE, 2002, pp. 442–452.
[77] K. Banerjee, S. Banerjee, and S. Sarkar, “Data-race detection: the
missing piece for an end-to-end semantic equivalence checker for paral-
lelizing transformations of array-intensive programs,” in ARRAY@PLDI,
2016, pp. 1–8.
[78] S. Blom, S. Darabi, and M. Huisman, “Verification of loop parallelisa-
tions,” in FASE, 2015, pp. 202–217.
[79] S. Dutta, D. Sarkar, A. Rawat, and K. Singh, “Validation of loop
parallelization and loop vectorization transformations,” in ENASE, 2016,
pp. 195–202.
[80] S. Dutta, “Validation of parallelizing transformations of sequential
programs,” Concurrency and Computation: Practice and Experience,
vol. 29, no. 8, 2017.
[81] S. Verdoolaege, G. Janssens, and M. Bruynooghe, “Equivalence check-
ing of static affine programs using widening to handle recurrences,”
ACM TOPLAS, vol. 34, no. 3, p. 11, 2012.
