Systematic Design Methods for Efficient Off-Chip DRAM Access by Bayliss, Samuel
Imperial College London
Department of Electrical and Electronic Engineering




Supervised by George A. Constantinides
Submitted in part fulfilment of the requirements for the degree of
Doctor of Philosophy in Electrical and Electronic Engineering of Imperial College
London
and the Diploma of Imperial College London
1
The copyright of this thesis rests with the author and is made available under
a Creative Commons Attribution Non-Commercial No Derivatives licence. Re-
searchers are free to copy, distribute or transmit the thesis on the condition that
they attribute it, that they do not use it for commercial purposes and that they do
not alter, transform or build upon it. For any reuse or redistribution, researchers
must make clear to others the licence terms of this work
2
Abstract
Typical design flows for digital hardware take, as their input, an abstract description
of computation and data transfer between logical memories. No existing commercial
high-level synthesis tool demonstrates the ability to map logical memory inferred from
a high level language to external memory resources. This thesis develops techniques for
doing this, specifically targeting oﬀ-chip dynamic memory (DRAM) devices. These are
a commodity technology in widespread use with standardised interfaces. In use, the
bandwidth of an external memory interface and the latency of memory requests asserted
on it may become the bottleneck limiting the performance of a hardware design. Careful
consideration of this is especially important when designingwith DRAMs, whose latency
and bandwidth characteristics depend upon the sequence of memory requests issued by
a controller.
Throughout the work presented here, we pursue exact compile-time methods for de-
signing application-specific memory systems with a focus on guaranteeing predictable per-
formance through static analysis. This contrastswithmuch of the surveyed existingwork,
which considers general purpose memory controllers and optimized policies which im-
prove performance in experiments run using simulation of suites of benchmark codes.
The work targets loop-nests within imperative source code, extracting a mathematical
representation of the loop-nest statements and their associatedmemory accesses, referred
to as the ‘Polytope Model’. We extend this mathematical representation to represent the
physical DRAM ‘row’ and ‘column’ structures accessed when performing memory trans-
fers. From this augmented representation, we can automatically deriveDRAMcontrollers
which buﬀer data in on-chip memory and transfer data in an eﬃcient order. Buﬀering
data and exploiting ‘reuse’ of data is shown to enable up to 50× reduction in the quantity
of data transferred to externalmemory. The reordering ofmemory transactions exploiting
knowledge of the physical layout of the DRAM device allowing to 4× improvement in
the eﬃciency of those data transfers.
3
Acknowledgements
This work was funded through a Doctoral Training Award from the UK Engineering and
Physical Sciences Research Council. I’d like to thank my supervisor George Constan-
tinides for his guidance and encouragement. Throughout my PhD he steadfastly made
time to steer my research with sharp insight and helpedmarshall my thoughts with kind-
ness and patience. I’ve greatly enjoyed working together and could not have wished for
a better role model in embarking upon my own academic career.
The loving support of my girlfriend Nicole made this thesis possible. I’m also immea-
surably grateful for the care and advice given by my parents, my two sisters, Naomi and
Hephzibah, and my long-suﬀering flatmate Simon. I dedicate this work to you all.
I’d like to thank all my colleagues in the Circuits and Systems lab, especially Joshua
Levine, James Mardell, David Boland, Ed Stott and Alastair Smith for making the work-





1.1. Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.2. Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.3. Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.4. Statement of Originality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2. DRAMMemory Fundamentals 19
2.1. DRAM Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.2. Dynamic RAM Device Operation . . . . . . . . . . . . . . . . . . . . . . . 21
2.3. Bounds on Bandwidth/Latency . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.3.1. Simple Controller: Worst Case Bandwidth/Latency . . . . . . . . . 27
2.3.2. Complex Controller: Best Case Bandwidth/Latency . . . . . . . . . 30
2.4. Historical Scaling Trends . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.4.1. Scaling Trends in Dynamic Memory Parameters . . . . . . . . . . . 32
2.4.2. Scaling Trends in Silicon Package Pin Density . . . . . . . . . . . . 34
2.5. Existing Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.5.1. Memory Controllers with Dynamic Command Queues . . . . . . . 36
2.5.2. Memory Controllers Designed using Static Analysis . . . . . . . . . 42
2.6. Scope of Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3. Modelling Memory Accesses using the Polytope Model 47
3.1. Introducing Polytopes for Program Analysis . . . . . . . . . . . . . . . . . 48
5
Contents
3.2. Static Control Programs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.2.1. Analysis of Data Dependencies in Static Control Programs . . . . . 55
3.2.2. Transformation of Polytopes in Static Control Programs . . . . . . 56
3.2.3. Code Generation from Polytopes . . . . . . . . . . . . . . . . . . . 58
3.3. Counting Integer Points in Polytopes . . . . . . . . . . . . . . . . . . . . . 59
3.3.1. General Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.3.2. Counting Integer Points in Cones . . . . . . . . . . . . . . . . . . . . 64
3.3.3. Evaluating Generating Functions . . . . . . . . . . . . . . . . . . . . 66
3.4. Counting Integer Points in Parametric Polytopes . . . . . . . . . . . . . . . 67
3.5. Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4. Design of Parametric DRAM Controllers using the Polytope Model 70
4.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.2. Decoupling Memory Access from Execution using On-Chip Memory . . . 72
4.3. Designing High-Performance Hardware for Polytope Scanning . . . . . . 78
4.4. Reordering Inner Loop Memory Accesses Using Strictly Monotonic Mem-
ory Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.4.1. Motivating Example . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.4.2. Parametric Integer Linear Programming Formulation . . . . . . . . 82
4.4.3. A Formulation for Finding a Strictly Monotonic Address Sequenc-
ing Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
4.4.4. Inner Loop Hardware Implementation . . . . . . . . . . . . . . . . 87
4.4.5. Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 92
4.5. Reordering Inner-LoopMemoryAccesses usingVariable Elimination Tech-
niques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
4.5.1. Motivating Example . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
4.5.2. Overview of Methodology . . . . . . . . . . . . . . . . . . . . . . . 104
4.5.3. Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
4.5.4. Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
6
Contents
4.6. Comparison Between Proposed Memory Scheduling Techniques . . . . . 123
4.7. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
5. Predictable Memory Access Scheduling using Integer Point Counting Tech-
niques 126
5.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
5.2. Motivational Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
5.3. Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
5.3.1. Initial Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . 136
5.3.2. Scattering Functions for Sequential Operation Ordering . . . . . . 138
5.3.3. Representing Row Activation Delays as Additional Statements . . 140
5.4. Defining Parametric Subsets of Memory Operations . . . . . . . . . . . . . 145
5.5. Representing Exact Cycle Scheduling using Parametric Sets . . . . . . . . 148
5.6. Finding Bounds for Quasi-Polynomial Expressions . . . . . . . . . . . . . 150
5.6.1. Bernstein Decomposition over an Interval . . . . . . . . . . . . . . 151
5.6.2. Bernstein Decomposition over a Convex Polytope . . . . . . . . . . 153
5.6.3. Overlapping Data Transfer and Computation . . . . . . . . . . . . 154
5.7. Compile Time Evaluation of Overlapped Task Execution Time . . . . . . . 155
5.8. Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
5.9. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
6. Conclusion 162
6.1. Summary of Key Thesis Achievements . . . . . . . . . . . . . . . . . . . . 162
6.2. Suggested Future Research Directions . . . . . . . . . . . . . . . . . . . . . 163
6.2.1. Exploration of Scalable Bounding Methods for Quasi-Polynomials 164
6.2.2. Exploration of Determistic DRAM Refresh Options . . . . . . . . . 165
6.2.3. Derive SDRAM Bank Partitioning Scheme from Memory Access
Schedule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
6.2.4. Integrating NAND Flash Memory into Design Flow . . . . . . . . . 166




A. Code Listings for Benchmark Examples 181
A.1. Matrix-Matrix Multiply . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
A.2. Sobel Edge Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
A.3. Gaussian Back-Substituion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
A.4. Blocked Gaussian Back-Substituion . . . . . . . . . . . . . . . . . . . . . . . 183
B. Evaluating Generating Functions 184
8
List of Figures
2.1. DRAM architecture showing eight DRAM bit cells (1-T). . . . . . . . . . . 22
2.2. Sense amplifier structure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.3. Precharge circuit structure. . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.4. A sequence of commands with worst-case bandwidth utilization. . . . . . 29
2.5. A sequence of commands which maximizes bandwidth utilization. . . . . 30
2.6. Plot showing the historical trend in the ratio of Trc to Tck for successive
DRAM technologies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.1. Simple code example showing enumeration of statements within a single
basic block. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.2. Simple Loop. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.3. Three Level Loop Nest. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.4. Non-Rectangular Loop Nest. . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.5. Non-Rectangular Loop Nest with Array Access. . . . . . . . . . . . . . . . 52
3.6. Parallel Schedule for Loop Nest shown in Figure 3.5. . . . . . . . . . . . . 53
3.7. Two Data Dependence Examples. (a) has no dependencies and all state-
ments may execute in parallel. (b) has a loop carried dependency which
forces sequential execution of all statements. . . . . . . . . . . . . . . . . . 55
3.8. Number line representing P = [0 : 3] ∩Z. . . . . . . . . . . . . . . . . . . . 60
3.9. Number line representing P = [0 : 3] ∩ Z and a decomposition into two
series. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.10. Representation showing supporting cones on number line. . . . . . . . . 63
9
List of Figures
3.11. Representation of two-dimensional supporting cones. . . . . . . . . . . . 64
3.12. Representation of (a) two-dimensional cone and (b) Fundamental Paral-
lelepiped within that cone. . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
3.13. Chamber decomposition of Example 3.4.1. . . . . . . . . . . . . . . . . . . 67
4.1. Two level nested loop example. . . . . . . . . . . . . . . . . . . . . . . . . 73
4.2. Example showing three alternative parameterizations of Figure 4.1. . . . 74
4.3. Execution schedule showing communication and execution under three
diﬀerent parameterizations. . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.4. Single Unit. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.5. Aﬃne Unit. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.6. Multiple Unit. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.7. Example showing (a) source code and (b) Output code for solution to
example (a). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
4.8. Mapping from (i, j) iteration space to memory addresses for code in Fig-
ure 4.7(a). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.9. Inner Loop Design. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
4.10. Next Block Design. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
4.11. ResourceUsage versusMemoryAccess Time for (a)MatrixMultiply bench-
mark benchmarks (continued on next page). . . . . . . . . . . . . . . . . . 96
4.11. (continued from previous page) Resource Usage versus Memory Access
Time for (b) Sobel EdgeDetection and (c) Gaussian Backsubstitution bench-
marks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
4.12. Figure showing SDRAM bandwidth allocation by command type within a
single reference in the Gaussian Backsubstitution benchmark. . . . . . . . . 99
4.13. C source code for 3-level nested loop example. . . . . . . . . . . . . . . . . 103
4.14. Memory address and associated fields. . . . . . . . . . . . . . . . . . . . . 103
4.15. Flowchart showing steps in methodology. . . . . . . . . . . . . . . . . . . 105
4.16. C source code for 2-level nested loop example. . . . . . . . . . . . . . . . . 110
10
List of Figures
4.17. Finding a unimodular matrix which maximises the number of eliminable
columns. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
4.18. Transformed source code for memory accesses in example code from Fig-
ure 4.13. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
4.19. SDRAMMemory Interface Utilization : Breakdown by Command Type. . 119
4.20. Pareto-optimal fronts showing designs parameterised at diﬀerent levels. . 121
5.1. C source code for 3-level nested loop example. . . . . . . . . . . . . . . . . 129
5.2. C source code for transformed memory access thread. . . . . . . . . . . . 131
5.3. C source code for transformed datapath thread. . . . . . . . . . . . . . . . 132
5.4. Concurrent operations in the memory access and datapath threads. . . . 135
5.5. C source code for transformed memory access thread before ‘row delay’
statements are inserted. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
5.6. C source code for transformed memory access thread after ‘row delay’
statements are inserted. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
5.7. Fifth Degree Bernstein Polynomial from [1] showing convex hull property. 152
B.1. Example polyhedron showing seven integer points. . . . . . . . . . . . . . 184
11
List of Tables
2.1. Cypress SRAM Characteristics. . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.2. Micron DRAM Characteristics. . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.3. Table showing typical constraints within a DDR2 SDRAM device. . . . . 28
2.4. Scaling Trends in Trc for commodity DRAM technologies. . . . . . . . . . 33
2.5. Scaling Trends in Pin Count for High Performance Chip and Commodity
DRAM technologies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.1. Legend describing the aﬃne functions referenced in Figure 4.10. . . . . . 90
4.2. Resource usage and maximum frequency comparison between inferred
divider circuits and optimized scalar division circuits. . . . . . . . . . . . 91
4.3. Table of Sequence Characteristics. . . . . . . . . . . . . . . . . . . . . . . . 94
4.4. Table of Hardware Characteristics. . . . . . . . . . . . . . . . . . . . . . . . 95
4.5. Table showing Total Memory Access Time and Command Breakdown. . 99
4.6. Sequence of memory accesses generated by example in Figure 4.13. . . . 104
4.7. Simulation results for benchmark codes. . . . . . . . . . . . . . . . . . . . . 118
4.8. Synthesis results for benchmark codes. . . . . . . . . . . . . . . . . . . . . 120
4.9. Tool Runtime. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
5.1. Sequence of memory accesses generated by memory access thread in Fig-
ure 5.2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
5.2. Sequence of memory accesses generated by datapath thread in Figure 5.3. 134
5.3. Table of Symbols used in this Chapter. . . . . . . . . . . . . . . . . . . . . 137
5.4. Useful memory operation subsets. . . . . . . . . . . . . . . . . . . . . . . . 146
12
List of Tables
5.5. Results showing impact of overlapping in datapath and execution thread
(Each datapath iteration = 1 cycle). . . . . . . . . . . . . . . . . . . . . . . 158
5.6. Results showing impact of overlapping in datapath and execution thread
(Each datapath iteration = 4 cycles). . . . . . . . . . . . . . . . . . . . . . . 158
5.7. Results showing impact of overlapping in memory access and datapath
thread (Each datapath iteration = 8 cycles). . . . . . . . . . . . . . . . . . . 159




Alongside a functional specification for a target application, all digital hardware designers
face the challenge of ensuring their implementations meet finite timing constraints and
operate within a specified power budget. Electronic Design Automation (EDA) tools
help designers evaluate diﬀerent, functionally equivalent designs and optimize over
many (often-conflicting) objectives. The process of such design-space exploration can be
greatly simplifiedwheremodels enable the prediction of the quality of results (achievable
clock frequency, necessary logic area or required power) without exhaustive simulation.
Many transformation techniques allow the trade-oﬀof parallel execution and sequential
execution. In one approach, digital hardware designers may find and exploit parallelism
in the datapath of their application. This has the impact of increasing the silicon area
of their implementation, whilst reducing overall task runtime. Equally there are trans-
formations which when applied, increase overall runtime in order to reduce silicon area
cost. One example of this is transformations which introduce resource-sharing into logic
circuits. When appropriately applied, these transformations can reduce silicon area at
the cost of decreased datapath throughput. It is diﬃcult to judge the overall eﬀectiveness
of datapath transformations without first taking into consideration the limitations of the
interconnect which supplies it with data.
In many computational kernels, it is necessary to fetch input data from, and write out-
put data to oﬀ-chip memory. Many conventional approaches used in memory systems
to achieve high memory performance hinder eﬀective calculation, at design time, of the
achievable data throughput. For instance, cache memories make use of statistical prop-
erties of the data access sequence and where there is spatial and/or temporal locality in
14
CHAPTER 1. INTRODUCTION
the targeted sequence of addresses, the average data throughput is improved. Equally,
DRAM (dynamic random access memory) controllers have variable throughput depend-
ing upon the order of memory accesses presented to them. The dynamic behaviour
of cache memories and the variable throughput of conventional SDRAM (synchronous
dynamic access memory) controllers makes it diﬃcult to model interconnect behaviour,
which in turn hinders the design of eﬃcient datapath circuits. This can lead to ineﬃcient
designs where the datapath sits idle waiting for data.
For some applications where hard real-time deadlines must be met, the only way to
guarantee functional requirements aremet is to assumeworst-case interconnect behaviour
and design a logic pipeline based on these assumptions. As we shall see in Chapter 2.3,
when SDRAM memory technology is used, the gap between best and worst case mem-
ory performance estimates can be > 20×. Designing for worst-case performance using
SDRAM memory may leave the memory interface underutilized, and make it diﬃcult
to achieve design objectives such as a required minimum sampling frequency for signal
processing, or a specified frame-rate in a video application.
This thesis oﬀers a way of statically evaluating the data required by an application
description and designing a custommemory controller to deliver eﬃcient and predictable
behaviour. In moving towards this goal, the thesis demonstrates techniques which allow
decoupling of datapath and communication primitive operations in separate threads,
with communication through intermediate storage in on-chip memory. Through the
use of a high-level description of DRAM performance, we can describe transformations
which lead to the reordering of memory operations in the memory access thread. This
enables eﬃcient use of oﬀ-chip bandwidth and crucially, preserves predictable behaviour
so that we can provide memory bandwidth guarantees when developing hard real-time
applications.
The overall vision is that future high-level synthesis toolsmight be able to selectwhether
‘inferred’ memory accesses within a code-kernel are mapped to a location in on-chip
memory or oﬀ-chip DRAM memory. Access to external DRAM would be made through
a custom command sequencer integrated into the datapath which intelligently prefetches
15
CHAPTER 1. INTRODUCTION
data into on-chip buﬀers in an orderwhichmaximizes external bandwidth eﬃciency. This
would contrast with existing approaches which require hardware designers to explicitly
instantiate ‘DRAM Controller IP’ for external memory interfacing. The key benefit of
this would be that exact knowledge of the bandwidth and latency of oﬀ-chip memory
access could ultimately be incorporated into ‘resource allocation’, ‘resource scheduling’
and ‘resource binding’ stages in a high-level synthesis flow to maximize the resource
eﬃciency of realised designs, while providing compile-time performance guarantees.
This PhD thesis moves us towards this vision by providing a systematic framework for
representing the DRAMmemory accesses within a code kernel, and novel static methods
for evaluating DRAM performance at compile time. These allow us in Chapter 4 to
demonstrate application-specificmemory sequencers that exploit reuse of data in on-chip
memory buﬀers to reduce the data transferred on the external interface by up to 50× and
improve memory bandwidth eﬃciency by up to 4×.
1.2. Contributions
The main contributions of this thesis are :
• A novel method for optimizing memory access patterns specifically for eﬃcient
access to oﬀ-chip DRAM memories. This transformation can be applied to any
computational kernel which may be expressed using a pre-existing mathematical
framework: the ‘Polytope Model’.
• An automated method for deriving an eﬃcient application specific memory ad-
dress sequencer from input code that exploits reuse of data in on-chip memory and
reordering of memory requests to external memory to achieve high performance.
• A novel technique for overlapping concurrent ‘datapath’ operations with ‘memory
access’ operations that represents the dataflow requirements between two commu-
nicating threads as non-linear inequality constraints.
16
CHAPTER 1. INTRODUCTION
• A techniquewhich, using these non-linear constraints, finds optimal solutionswhich
achieve maximal overlap of operations in concurrent ‘datapath’ and ‘memory ac-
cess’ operations using contemporary polynomial bounding techniques.
• The use of integer point counting techniques to systematically determine, at compile
time, the runtime of a computational task including delays caused by DRAM access
timing constraints.
1.3. Overview
The contents of this thesis are arranged as follows. Chapter 2 summarises the problem of
providing fast and predictable external DRAMaccess and highlights existingworkwhich
attempts to improve DRAM controller design both by dynamic reordering of memory
transactions and through the static analysis of sequences of code. The results from this
work motivate our search for more tractable analysis techniques for memory system
optimization which make use of the PolytopeModel. Chapter 3 provides an introduction
to the Polytope Model providing background to the technical contributions introduced
in Chapter 4 and Chapter 5.
In Chapter 4, we use the mathematical abstraction provided by the Polytope Model to
describe the data accesses arising from code in nested loops as finite integer sets. We start
by showing how we can derive two threads from a nested-loop within original source
code: one dedicated to memory access and the other implementing a hardware datap-
ath. We then show how we can parametrically trade-oﬀ external memory bandwidth
for on-chip logic and memory resources and demonstrate two memory access schedul-
ing techniques for automatically producing application-specific parameterised memory
controllers. These controllers exploit data reuse and transaction reordering to maximise
DRAM bandwidth eﬃciency.
We build upon this work in Chapter 5, where we develop an analytical technique
for scheduling concurrent operations in ‘memory access’ and ‘datapath’ threads. Using
a method which links the memory access scheduling techniques from Chapter 4 with
the analysis of polynomial functions, we give certificates which prove that concurrent
17
CHAPTER 1. INTRODUCTION
scheduling of operations in the two threads preserves the intended program semantics
from the original source description. Timing information derived from this concurrent
scheduling allows us to determine the overall runtime of a compute task at compile time,
taking into account the delays introduced by the memory system. These guarantees
are extremely useful in certifying that tasks in real-time applications meet their hard
scheduling deadlines.
The thesis is concluded inChapter 6with a summary of its key points and contributions.
We additionally outline areas of future research which build upon the ideas and methods
originating within this work.
1.4. Statement of Originality
The work in this thesis is my own, except where it has been appropriately referenced
and attributed. The original contributions made in this thesis have been published as
peer-reviewed conference papers and journal articles in the following publications.
1. S. Bayliss and G. A. Constantinides, “Optimizing SDRAM Bandwidth for Cus-
tom FPGA Loop Accelerators,” in FPGA’12 : Proceedings of the 20th ACM/SIGDA
International Symposium on Field-Programmable Gate Arrays (K. Compton and B. L.
Hutchings, eds.), pp. 195–204, ACM, 2012.
2. S. Bayliss and G. A. Constantinides, “Analytical Synthesis of Bandwidth-Eﬃcient
SDRAM Address Generators,”Microprocessors and Microsystems, 2012.
3. S. Bayliss andG. A. Constantinides, “Application specificmemory access, reuse and
reordering for SDRAM,” in ARC’11 : Proceedings of the 7th International Symposium
on Applied Reconfigurable Computing : (A. Koch, R. Krishnamurthy, J. McAllister,
R. Woods, and T. A. El-Ghazawi, eds.), vol. 6578 of Lecture Notes in Computer Science,
pp. 41–52, Springer, 2011.
4. S. Bayliss and G. A. Constantinides, “Methodology for Designing Statically Sched-
uled Application-Specific SDRAM Controllers using Constrained Local Search,” in
FPT’09 : Proceedings of the 2009 IEEE International Conference on Field-Programmable
Technology, (Sydney, Australia), pp. 304–307, IEEE, 2009.
18
2. DRAMMemory Fundamentals
Dynamic Random Access Memories (DRAMs) are used as the external oﬀ-chip memory
in most general purpose computers. They are designed for high yield and low cost
manufacturing and have densities which exceed competing technologies. However, to
lower costs, much of the burden of coordinating DRAM operations is shifted onto an
external memory controller rather than integrated into the DRAM device. This chapter
provides background on how DRAM devices are structured, the challenges in achieving
high performance with an SDRAM memory controller and opportunities for improving
performance through application-specific memory controller designs.
We begin in Section 2.1 with a general comparison of DRAM memory with other
memory technologies. Section 2.2 provides a focused description of how DRAM devices
are structured and how memory operations are sequenced. Through this, we develop
an explanation of how the internal structure of the DRAM device is responsible for
determining the timing parameters whichmust bemet by an SDRAM controller to ensure
correct operation. In Section 2.3,wederive boundson the best andworst caseperformance
of SDRAMmemory, highlighting the device parameters which are essential to achieving
high performance. The historical trends in these parameters are described in Section 2.4
andhelpmotivate our belief that research into the systematic designof SDRAMcontrollers
will have a lasting impact upon electronic design.
A review of the design of existing DRAM controllers follows in Section 2.5 alongside
a short survey of relevant academic work aimed at designing more eﬀective DRAM
controllers. We conclude the chapter in Section 2.6 with a problem statement, establishing
some key terms of reference for the key technical chapters of this thesis.
19
CHAPTER 2. DRAMMEMORY FUNDAMENTALS
Table 2.1.: Cypress SRAM Characteristics.
Manufacturer Cypress Cypress Cypress
Model CY7C1313KV18 CY7C1412KV18 CY7C1515KV18
Capacity 18Mb 36Mb 72Mb
Speed 250MHz 250MHz 300MHz
Ports 1 1 1
Pins 165 165 165
Package Area (mm2) 195 195 195
Sept. 2012 Price (Qty : 1000+) $22.02 $40.89 $132.92
Table 2.2.: Micron DRAM Characteristics.
Manufacturer Micron Micron Micron
Technology DDR2 SDRAM DDR3 SDRAM DDR3 SDRAM
Model MT47H128M8 MT41J128M8 MT41J128
Capacity 1Gb 1Gb 2Gb
Speed 333MHz 667MHz 800MHz
Package Area (mm2) 80 126 126
Pins 60 78 78
Sept. 2012 Price (Qty : 1000+) $4.80 $4.65 $5.48
2.1. DRAM Characteristics
Commodity DRAM memories are used as the main memory in most general purpose
computers, in mobile phones and many embedded devices. Since each memory cell (bit)
in the DRAM memory requires only a single transistor (1-T), density is much improved
over SRAM technologies inwhichmemory cells have 4-T or 6-T architectures. Thismeans
not only that the average price-per-capacity is lower forDRAM technologies, but that they
are available in densities much larger than equivalent SRAM technologies. Micron oﬀers
capacities of up to 8Gbits in a single 92-ball 10.5x12mm Ball Grid Array package.
Indicative pricing for themost price-competitive SRAMs (thosewith single ports which
return a single data-word per clock-cycle) is shown in Table 2.1. Comparison with Table
2.2 shows the price gap with contemporary SDRAMs devices, SDRAMs oﬀer 250×more
capacity per dollar than SRAM equivalents.
Alongside the eﬃciency gained fromusingdensermemory-cell arrays than SRAMtech-
nology, modern double-data-rate (DDR) SDRAMdevices have fewer pins than equivalent
20
CHAPTER 2. DRAMMEMORY FUNDAMENTALS
SRAM technology. This reduces packaging costs. One method used by SDRAM devices
to reduce pin-count is time-multiplexing of the memory address bus. Addresses are di-
vided into row and column address fields which are asserted onto the address bus by the
controller in diﬀerent clock cycles. These features help reduce the overall pin count on
the memory chip and the device connecting to it, which reduces chip packaging cost. An-
other technique is data transfer to and from the DRAM chip on both the rising and falling
edge of the clock, thus enabling fewer data pins at a given memory bandwidth1. All
these cost-cutting features help make DDR SDRAMs the dominant commodity memory
technology.
The DRAM device architecture, optimized for high bandwidth, low-pin-count and
high capacity relative to SRAMdevices devolves much of the responsibility for managing
memory operation to an external controller. The controller in an SDRAM-based system
must explicitly sequence the activation of rows within the SDRAM before columns in
those rows can be read. It must manage operations in up to eight independent banks and
manage periodic refresh cycles since leakage current from the cells in the memory array
eventually makes the stored charges unreadable. All these operations must meet the
timing specifications/constraints of the SDRAMmemory device to ensure reliable opera-
tion. In Section 2.2 which follows, we give a more detailed description of SDRAM device
structure to motivate a discussion of the parameters which limit SDRAM performance.
2.2. Dynamic RAMDevice Operation
This section gives an outline of DRAMdevices are structured, and the sequence of actions
necessary to store or read data from them. We show how device architecture determines
memory timing constraints which must be enforced by the external memory controller.
DRAM devices store data as charged nodes within a dense array of memory-cells [2].
Figure 2.1 shows eight DRAM cells in an open-array architecture, with four logical rows
and two logical columns. The figure shows row-lines running horizontally across the
1This technique is also increasingly employed in SRAM devices, which are equipped with a single address
bus, and separate data buses for read and write operations to facilitate double-buﬀering techniques.
21
CHAPTER 2. DRAMMEMORY FUNDAMENTALS
diagram. The assertion of a row-line enables charge-sharing between each memory
cell capacitor in the row and an associated bit-line running vertically through the array.
Activation of each row-line is mutually exclusive, so sense amplifiers are arranged to
measure the voltage diﬀerence between the selected bit-line and an adjacent inactive bit-
line used as a reference. The sense amplifiers boost the small charge diﬀerential stored





bit-line 0 bit-line 1 bit-line 2 bit-line 3
Sense Amplifier Sense Amplifier
Multiplexercol-select
data-out
Figure 2.1.: DRAM architecture showing eight DRAM bit cells (1-T).
Figure 2.2 shows a sense amplifier implemented using two cross coupled inverter
structures. If the two bit-lines are at diﬀerent voltages and SAN is brought closer to
ground, the N-type transistor with the highest gate voltage (more +ve bitline) will begin
to conduct and current flowing through it will discharge the more negative bitline to
ground. If SAP is subsequently gradually pulled to VCC, current flows through the P-
type transistor whose gate is held at GND and the more positive bitline will be charged to
VCC. While DRAM device substrates are doped to ensure low leakage current from the
array of cells, leakage from the memory cells eventually means logical values cannot be
distinguished. Reading amemory row is itself a restorative action but SDRAMcontrollers
typically also issue periodic ‘refresh’ commands to each row in the device to ensure
22
CHAPTER 2. DRAMMEMORY FUNDAMENTALS
bit-line 0 bit-line 1SAP
SAN
Figure 2.2.: Sense amplifier structure.
VCC/2
PRE
bit-line 0 bit-line 1
Figure 2.3.: Precharge circuit structure.
the entire contents of the memory are preserved regardless of the frequency in which
individual items are accessed.
A consequence of the doping strategy adopted in DRAM device substrates (which is
tailored to ensure low leakage), is that transistor switching speed on the substrate is
23
CHAPTER 2. DRAMMEMORY FUNDAMENTALS
slow relative to processes optimized for logic, such as those found on an FPGA or CPU
die. This provides the motivation for the prefetching architecture used on all modern
SDRAM devices. For instance, all DDR3 devices employ an 8-n prefetch mechanism,
which subdivides the DRAM arrays of each bank into eight striped subarrays which are
accessed in parallel in response to each memory burst request. In DDR3 memories, the
memory array can therefore operate four-times slower than the command clock (and
eight-times slower than the double-data-rate data clock). Each burst request causes the
transfer of multiple data items over several clock periods, with a minimum burst size
determined by the device prefetch mechanism (8-n in DDR3 SDRAMmemories, and 4-n
in DDR2 SDRAMmemories).
The sequence of operations needed to read a burst of data from an address in theDRAM
follows :
1. A ‘precharge’ command is issued to the DRAM device by the memory controller.
SAP and SAN are allowed to float and a precharge circuit like the one shown in
Figure 2.3 ensures all bit lines (one for each column in the array) are precharged to
exactly matching voltages halfway between VCC and GND.
2. The ‘PRE‘ signal is deasserted, leaving the bit-lines floating. The capacitance of the
long bit-lines means that they retain their potential at a midpoint between high and
low logic values.
3. An ‘activate’ command is issued to the DRAMdevice by thememory controller. The
address asserted alongside the ‘activate’ command on the address bus is decoded
and the row-line of the selected SDRAM row is driven high. This connects one
capacitor to one of the two bit lines connected to each sense amplifier. Charge is
shared between the selected storage cell and the appropriate bit-line. If a ‘1’ is stored
on the capacitor, the voltage on the associated bitline will rise slightly. If a ‘0’ is
stored, the voltage on the bitline will drop slightly. Since only one row of cells was
connected to the bitline, the voltage on the reference bitline at the sense amp should
not change and as a consequence, a diﬀerential voltage will be observed at the sense
amplifier.
24
CHAPTER 2. DRAMMEMORY FUNDAMENTALS
4. The sense amplifier is switched on by driving SAP and SAN to the positive and
negative supply voltages respectively. The positive feedback of the cross coupled
inverters amplifies the small voltage diﬀerence until one bit line is fully low and the
other is fully high. At this point, the row is in an ‘open’ state and data can be read
continuously from that row using ‘read’ commands. A time delay referred to as Trcd
specifies the maximum time required to reach a stable state and constrains the time
a memory controller must wait before issuing a read command.
5. A ‘read’ command is issued to the SDRAM device by the memory controller along-
side a column address asserted on the address bus. Column addresses drive a
multiplexer to select which words to read from the open row.
6. While reads proceed, current flows back up the bit lines from the sense amplifiers
to the storage cells. This restores the charge in (refreshes) the storage cell. Due
to the length of the bit lines, this takes significant time beyond the end of sense
amplification, and overlaps with one or more column reads.
7. When done with the current row, the sequence can begin again with the issuing of
a new ‘precharge’ command.
Each of the operations in the sequence described above takes a finite amount of time
determined by the physical layout of the device. As an example, the time taken to
precharge the device bit-lines depends on their RC values and transistor sizing in the
sense amplifier. Following a ‘precharge’ command, the memory controller must ensure
that suﬃcient time has elapsed to ensure the bit-lines are fully precharged before issuing
an ‘activate’ command. After the ‘activate’ command is issued, the memory controller
must not issue a ‘read’ or ‘write’ command until the bit-lines have settled to a stable state.
Timing parameters are specified by memory device manufacturers and then hard-coded
into memory controllers, or read using a side-channel method from memory modules
when the system is initialized. Correct operation of the memory is only guaranteed when
the broad range of timing requirements are met.
25
CHAPTER 2. DRAMMEMORY FUNDAMENTALS
Since the data values within the DRAM are held on charged nodes, leakage current
through the transistor in each bit-cell eventually causes data errors. The values must
therefore be periodically refreshed. Each ‘activate’ command issued by the controller
regenerates the values in the bit-cells. The controller can also issue ‘refresh’ commands
which renew rows across multiple banks simultaneously. The necessity of issuing refresh
commands reduces the bandwidth and increases the worst-case latency of the memory
interface. The overall impact of refresh commands on available bandwidth is calculated
in Section 2.3.
Synchronous Dynamic Access Memories (SDRAMs) diﬀer from older DRAM technolo-
gies since the commands issued to them by a controller are sampled on a clock edge
rather than allowing operation running asynchronously to the system clock. The chal-
lenge when designing an SDRAM controller is therefore the generation of a discrete
sequence of commands to be issued to the memory on diﬀerent clock cycles.
While the timing constraints are consistent across the various DRAM device families,
we shall specifically concentrate on the timing constraints of DDR2 SDRAM standard
parts. These SDRAM chips are made up of independent multiple banks (typically four or
eight banks per chip). This helps improve performance because partitioning a large array
into smaller arrays reduces the length of the individual bitlines (and thus their resistance,
parasitic capacitance and charging times). Performance may also be enhanced by inter-
leaving operations across thememory banksmeaning that the latency of a ‘precharge’ and
‘activate’ operation changing rows can be hidden by simultaneous accesses to a diﬀerent
bank.
In the section which follows, we describe how the physical properties of SDRAM
devices determine bounds on SDRAM bandwidth and latency. We can establish some
bounds on that performance by making assumptions about controller behaviour.
26
CHAPTER 2. DRAMMEMORY FUNDAMENTALS
2.3. Bounds on Bandwidth/Latency
In Section 2.2, we described the physical structure of an SDRAMdevice, and the sequence
of operations required to access information stored on it. In this section, we highlight
in more depth the important constraints on device performance which arise from the
physical device structure.
In Table 2.3 we give representative data from [3] stating some of the timing constraints
that the memory controller must satisfy. This includes the timing constraints already
highlighted in Section 2.2 and some additional ones specified by the device manufacturer.
In particular we highlight the Tfaw and Trrd constraints. These restrict the minimum time
delay between ‘activate’ commands to independent banks. These restrictions are necessary
to control power consumption, since DRAM devices are highly sensitive to high current
draw (which causes the device supply voltage to droop). In general, the Trcd, Trp and
Trc timing constraints from Table 2.3 mean that the bandwidth and latency of SDRAM
accesses are determined by the controller and the sequence of addresses requested. In
the sections which follow, we demonstrate the worst and best-case memory performance
arising from these constraints.
2.3.1. Simple Controller: Worst Case Bandwidth/Latency
If we consider a simple controller which opens and closes a row for every memory word
requested, we can consider the maximum latency and minimum bandwidth bounds
this would imply. A timing diagram showing the sequence of SDRAM commands this
controller would generate is shown in Figure 2.4. The command bus contains ‘Nop’
commands which are introduced by the controller to ensure that the timing constraints
aremet. For every transaction, the controller opens anSDRAMrowby issuing an ‘activate’
command, reads a single burst of data by issuing a ‘read’ command and then closes the
row with a ‘precharge’ command. In Figure 2.4, data read from memory is only valid
when the ‘valid’ signal is asserted high. The controller is ineﬃcient since it does not
return valid data in every cycle.
27



























































































































































































































































































































































































































































































































CHAPTER 2. DRAMMEMORY FUNDAMENTALS








Figure 2.4.: A sequence of commands with worst-case bandwidth utilization.
Delays from a typical DDR2 Device [3] are used to demonstrate the maximum band-
width this controller can deliver. The interval between successive transactions is de-
termined by the sum of the SDRAM delays from activating a row to reading from it
(trcd = 15ns), issuing a read and precharging the row (trtp = 7.5ns) and precharging the
row and the row activation of the next transaction (trp = 15ns). In (2.1), a minimal burst
length of four data words is assumed, and contributes two clock cycles of delay. How-
ever constraints also bound the minimum period between successive activate commands
(trc = 15ns) and the minimum period between an activate and a precharge command
(tras = 40ns). These are represented in (2.2) and (2.3).
transaction interval (cycles) ≥￿ trcd
Tck
￿ + 2 + ￿ trtp
Tck
￿ + ￿ trp
Tck
￿ (2.1)
transaction interval (cycles) ≥￿ tras
Tck
￿ + ￿ trp
Tck
￿ (2.2)
transaction interval (cycles) ≥￿ trc
Tck
￿ (2.3)
By inspection, (2.1) and (2.2) both limit the transaction interval to ≥ 11 cycles at 200MHz
giving a bandwidth of 18.18MBytes/s2. The fixed latency of a single transaction (without
refresh) is trcd + CL cycles (CL = 3 or 4) giving a 7 cycle latency at 200MHz where CL is
the column latency, typically a 3 or 4 clock cycle latency between selecting a word from
the active row and its assertion on the data bus.
2(or alternatively 72.72MBytes/s if four consecutive words are requested, and they can be coalesced into a
burst, for instance in a 32bit word request).
29
CHAPTER 2. DRAMMEMORY FUNDAMENTALS
This achievable bandwidth will be reduced by necessary refresh cycles issued by the
controller to ensure memory retention. Some indication of the extent of this performance
penalty in the Altera SDRAM controller is given in [4] . If we assume Tr f c = 105ns [3]
refresh time, two words are prevented from being read by each refresh. A refresh must
be issued to every row with a maximum period of 64ms. With 8192 rows, this means the
memory controller must schedule 128k refresh operations a second, reducing the overall
number of reads by 128k x 2. This brings the overall bandwidth down to 17.92Mbytes/s,
a reduction of 1.014%. The latency assuming multiple transactions and including refresh
cycles is bounded by a worst case scenario. This scenario occurs when two reads are
requested in consecutive cycles and the controller introduces a refresh between them. If
Ttransaction is the time taken to complete the first transaction, the latency for this worst case






￿ + CL − 1. (2.4)
With a 200MHz clock, this is 38 clock cycles.
2.3.2. Complex Controller: Best Case Bandwidth/Latency





Figure 2.5.: A sequence of commands which maximizes bandwidth utilization.
In the best case, where the SDRAM controller fetches data from interleaved rows such
that it is able to completely hide the row-swap latency, it can sustain a data rate of 2 words
per clock cycle on its data bus. At 200MHz, and with an 8-bit word-size, this give us
400MBytes/s of bandwidth, a factor of 22× better than the worst case above.
30
CHAPTER 2. DRAMMEMORY FUNDAMENTALS
While there are many ways of managing refresh in this scenario, one is to precharge all
banks at the start of a refresh cycle, perform a refresh of all banks and then reactivate all
banks sequentially. Reactivating all eight banks in a typical DDR2 device must take into
account the Tfaw constraint (which prevents re-activation of all banks at the minimum
Trrd period). A minimum refresh schedule begins with a ‘precharge’ command, which
takes 17.5ns to precharge all banks, A subsequent ‘refresh’ commands takes Tr f c = 105ns
to refresh all banks and Trcd = 15ns to re-activate a row for a total of 137.5ns. This brings
worst case latency to 10ns + 137.5ns + CL = 167.5ns, which is 34 clock cycles at 200MHz.
The bandwidth reduction incurred by performing this refresh schedule 128k times
a second (to meet a maximum refresh period of 64ms) is 128k * 137.5ns = 17.6ms of
refresh operations each second (including activations) and given that we have assumed
two words per cycle, this is 7.04MByte/s of lost bandwidth. This makes the maximum
bandwidth after considering refresh cycles, 392.96MByte/s.
From this examination of the best and worst case performance, we see the huge impact
that the ordering of memory references can have upon available memory bandwidth.
In the best case, memory bandwidth can be more than 20× that which can be naively
achieved using a simple controller.
Many commercial memory controllers exploit this by buﬀering and dynamically re-
ordering memory transactions both to reduce the number of row swapping operations
and to hide the latency associated with them by scheduling a concurrent memory op-
eration in another bank. Such controllers are typically parameterized by the number of
transactions they are able to buﬀer. Increasing the number of buﬀered transactions in-
creases the probability of a grouping of transactions in the same row and the probability
of transactions in diﬀerent banks being available. This comes at a cost of both memory
area (inwhich to store the buﬀered transactions), logic area (since some logicmust be used
to select which transaction is scheduled), speed (since the choice of more transaction to
schedule implies higher logic fanout) and latency (since grouping of transactions is only
eﬀective if there are multiple transactions available to coalesce into one SDRAM row).
31
CHAPTER 2. DRAMMEMORY FUNDAMENTALS
The dynamic scheduling approach does not rely on any knowledge of the ordering of
incoming transactions. Without this knowledge, it is not possible to make guarantees
of bandwidth beyond the 17.92MBytes/s determined for the simple controller. However
where the goal is the design of hardware for a specific function, some prior knowledge
of this transaction ordering exists and static techniques may be used to produce more
eﬃcient hardware, so thatdynamic schedulingof SDRAMcommands isunnecessary. This
wouldmean a predictable level of performance can be guaranteed to the user application.
That realisation underpins the technical chapters of this thesis, which seek to exploit prior
knowledge of SDRAMmemory operations and group them to ensure high performance.
All controller designs are constrained by the SDRAM parameters associated with a
specific memory device, and in turn the ever evolving process technology upon which it
is designed. In the following section, we describe the historical scaling trends which have
underpinned performance in SDRAMmemory devices.
2.4. Historical Scaling Trends
The number of pins available on semiconductor devices, and the data rate available
on such pins, has not scaled with transistor density [5]. This, among other factors has
contributed to a ‘memory wall’ in which the parallelism achievable with computing
devices is limited by the speed in which large oﬀ-chip memories can be accessed. In this
section, we highlight three trends.
2.4.1. Scaling Trends in Dynamic Memory Parameters
In Table 2.4, we show a selection ofMicron DRAM components spanning five generations
of devices. The table shows that the Trc parameter, which determines the minimum time
which must elapse between two successive ‘activate’ commands issued within a bank,
falls from 66ns to 48.75ns over the 12-year period spanned by the synchronous DRAM
devices in the table. This 35% drop in Trc occurs over a time period in which we see
six-fold increases in memory clock speeds. The overall eﬀect is that we see a four-fold
increase in the number of clock periods between successive activations to the same bank.
32
CHAPTER 2. DRAMMEMORY FUNDAMENTALS
Table 2.4.: Scaling Trends in Trc for commodity DRAM technologies.
Technology Part No Year Capacity Clock Freq. Trc Ratio (Trc : Tck)
EDO MT4LC4M16R6-5 1998 64MBit - 84ns -
SDRAM MT48LC32M8A2-75 1999 256MBit 133MHz 66.00ns 9
DDR MT46V64M8-75E 2003 512MBit 133MHz 65.00ns 9
DDR2 MT47H128M8-3 2004 1GBit 333MHz 55.00ns 19
DDR3 MT41J256M8-187 2006 2GBit 533MHz 52.50ns 28
DDR3 MT41K2G4-125 2011 8GBit 800MHz 48.75ns 39
We can hypothesize that shrinking SDRAM process geometry has been used to deliver
larger device capacities rather than memory arrays with smaller physical dimensions on
the chip. This would explain the very slow decrease in Trc over time. However the overall
impact of this is that the performance penalty for changing rows in an SDRAMdevice has
increased significantly with each new memory generation. We show this in Figure 2.6.
Here the ratio of Trc : Tck is plotted against the year of introduction for the five SDRAM
technologies in Table 2.4. The figure shows that the number of cycles required to change
DRAM row has increased linearly over a 10 year period . The 2009 ITRS Roadmap [6]3
notes ‘Keeping the chip size approximately constant as the DRAM capacity (number of
bits per chip) is increased with scaling is very important from a chip cost point of view’.
This would suggest that the ratio of Trc : Tck is driven by process geometry and without
a change in overall device architecture, is likely to continue in future devices.
Future developments, which are focused on standardisation of DDR4 technology, spec-
ify a point-to-point topology for connecting memory devices to memory channels. The
move is designed to reduce signal integrity and bus-termination issues which arise when
the data signals from multiple SDRAM devices are multiplexed onto a single data-bus.
The success of this approach will rely on avoiding the need for external switching fabric,
and is in part predicated on the emergence of stacked-die DRAMproducts which achieve
a very high density through 3D integration. Such a move will enable memory frequency
to increase further in DDR4 devices (to > 2GHz) with little impact on Trc. We might very
reasonably expect the ratio of clock period to successive activation period Trc to double
again in the next five years.
3Process Integration, Devices and Structures, Page 12
33















 :  
T c
k







Figure 2.6.: Plot showing the historical trend in the ratio of Trc to Tck for successive DRAM
technologies.
2.4.2. Scaling Trends in Silicon Package Pin Density
Table 2.5 shows data from the ITRS roadmap [5] predicting trends for high performance
computing and commodity memory devices over a fifteen year period. From this table,
we see exponential growth in the number of transistors as predicted by the scaling trend
first identified byMoore [7]. The predicted number of pins on a high-end microprocessor
or ASIC is predicted to rise linearly over the next 15 years, with the number of pins
on a commodity memory device not expected to change. The cost-per-pin for high end
processors is expected to slowly decrease over time, keeping a constant cost per packaged
chip, while the packaging cost for an SDRAM chip is also expected to remain constant.
All this together means we face a world of exponential scaling of silicon device area for
compute, but with much slower scaling of the memory interconnect. The introduction
34
CHAPTER 2. DRAMMEMORY FUNDAMENTALS











2009 2200 1.64 - 84-100 0.21
2010 2500 1.56 - 84-100 0.20
2011 2500 1.48 4.424 84-100 0.25
2012 2900 1.41 4.424 84-100 0.24
2013 2900 1.34 8.848 84-100 0.23
2014 3600 1.27 8.848 84-100 0.22
2016 4000 1.15 17.696 84-100 0.21
2018 5300 1.04 17.696 84-100 0.21
2020 6500 0.94 35.391 84-100 0.21
2022 6500 0.85 70.782 84-100 0.21
2024 6500 0.77 70.782 84-100 0.21
of through-silicon-vias enables higher density connections between dies than flip-chip
packaging. However the density of such vias is limited by minimum area requirements
necessary both to meet wire-delay constraints and to ensure die-stacked products have
high manufacturing yields and are robust in everyday use.
From this, it is clear the ratio of silicon area to available memory bandwidth will
increase over time. In some applications, this will impact the eﬃcient use of silicon area,
since data cannot be supplied quickly enough to support data-parallel execution. In the
most abstract sense, this shifts the balance between computation and communication,
and algorithms may have to be adapted to emphasise cheap re-computation of results
over storage. It also means that techniques which trade silicon area to increase memory
bandwidth will seem increasingly attractive over time.
The two trends identified in this section, a trend towards longer cycle penalties for
switching SDRAM rows and an ever widening gap between on-chip silicon area and
oﬀ-chip memory bandwidth leave an opportunity for more intelligent andmore complex
SDRAM controllers. In Section 2.5, we describe existing literature in which silicon-area
has been traded for DRAM interface eﬃciency.
35
CHAPTER 2. DRAMMEMORY FUNDAMENTALS
2.5. Existing Approaches
In this section, we examine existing approaches to developing eﬃcient SDRAM controller
designs. We start by examining dynamic methods in which eﬃcient memory accesses
sequences are formed by a controller from a queue of memory requests. These are widely
deployed in general purpose and high-performance computing platforms. There are two
key drawbacks to using these methods in embedded and real-time applications. The
first is the significant hardware cost of implementing large buﬀers for memory requests
and complex arbitration logic to select an order in which to service them. The second
problem is that compile-time performance analysis of a system with dynamic scheduling
of memory requests is diﬃcult . When engineers must guarantee real-time deadline are
met, they adopt conservative designs based on analysis of worst-case performance. As
shown in Section 2.3, this leads to a very significant drop in achievable performance.
In the latter part of this section, we consider existing research into static methods, in
which compile time analysis is used to improvememory performance. The later technical
chapters of this thesis develop the use of static analysis techniques to build predictable,
application-specific, bandwidth-eﬃcient SDRAM controllers.
2.5.1. Memory Controllers with Dynamic Command Queues
On general purpose computers, the sequence of memory requests generated by the CPU
is typically unknown at design time, so the processor cache is designed to be able to reuse
and reorder data without a static compile time analysis of the program to be executed.
Complex dynamic memory controllers in modern CPUs buﬀer and dynamically reorder
cache line-fill requests to external memory. Both the cache and memory controllers
therefore contain memory and associative logic to buﬀer and dynamically select and
service memory requests.
A very large body of work is dedicated to improving cache performance, a compre-
hensive review can be found in [8]. Most of this work assumes the sequence of addresses
is randomly (but not necessarily uniformly) distributed and describes optimizations of
36
CHAPTER 2. DRAMMEMORY FUNDAMENTALS
dynamic on-chip structures to exploit data locality. As well as the direct area cost these
structures impose, the presence of caches and dynamic memory controllers makes it very
diﬃcult to predict memory performance at compile-time.
In themost basic SDRAMcontrollers [9], data requests are held in aqueue andprocessed
in order. The controller keeps an explicit record of the state of the rows in the SDRAM
bank and issues the memory ‘precharge’, ‘activate’, ‘read’ and ‘write’ commands needed
to service the active data request at the head of the queue. A scheduling policy determines
how the controller issues commands in response to the queued requests. Existing studies
have sought to quantify the impact of diﬀerent scheduling policies [10] and architectural
parameters [11], upon the performance of benchmark simulations.
The choice of scheduling policy is exemplified in the choice between ‘open-row’ and
‘closed-row’ scheduling policies. A controller may choose an ‘open-row’ policy, which
only issues ‘precharge’ requests when they are required to service the active data request.
Such controllers keep track of the active row in each bank. This policy is beneficial when
data-locality within an application means repeated bursts of data are issued to addresses
falling within a single DRAM row. In contrast, a ‘closed-row’ policy may be beneficial
where the user expects that the next data request issued to a certain bank will require a
diﬀerent active row. In this case, Rixner et al. [10] shows that the ‘closed-row’ policy can
reduce the average latency of memory requests and hence improve performance. In [11],
larger benchmarks from the SPEC2000 suite are used to demonstrate the interaction
between cache memory hierarchy design and SDRAM controller design. Their work
demonstrates that the memory-mapping of addresses to on-chip memory (cache-tag and
set-associative mapping based on bit-fields within the memory address) and memory-
mapping of addresses to oﬀ-chipmemory (SDRAMbank, row and column addresses also
based on bit-fields extracted from the memory address) interacts badly on cache write-
backs, which in some configurations guarantees a DRAM bank-conflict with each write-
back operation. This means every time a cache write-back occurred, at least one DRAM
‘precharge’ and ‘activate’ command would be issued to complete the operation. The
paper introduces an alternative address-mapping strategy which improves performance
37
CHAPTER 2. DRAMMEMORY FUNDAMENTALS
in the SPEC2000 benchmarks by 16% on average (and up to 64% in the most intensive
dense linear-algebra applications).
The variation in results demonstrated in [11] backs up a key result from [10] that none
of the fixed scheduling policies nor parameter choices considered were optimal across
all benchmarks. This provides the motivation for the approach considered in this thesis,
where FPGA programmability is used to pursue an application-specific approach and an
analytical model used to evaluate application performance at compile time, rather than
validation of design parameters through exhaustive simulation.
Others have tried to use dynamicmethods to improvememory performance [12, 13, 14].
These threeworks in particular describe algorithmswhich perform associative lookups of
incoming requests against queued requests to try reduce the number of requests thatmust
be issued to external memory. Incoming read requests trigger a search for a queuedwrite-
request to the same address. If a matching request is found, data-forwarding from the
write-queuemeans the read request can complete immediately. In each paper, the authors
also consider coalescing of multiple read-requests to the same address where they exist
within the queue. This technique can be eﬀective in improving memory performance if
the queue contains pending requests, but the act of maintaining a large queue implies that
requests are not serviced quickly. Shao and Davis [12] consider methods for managing
the length of the queue (Read-Preemption and Write-Piggybacking) which specifically
prioritise read accesses (which may block progress in a processor pipeline until data is
returned) over write accesses unless the write access queue is nearly full and threatens
to cause processor stall. An adjustable threshold parameter determines whether Read-
PremptionorWrite-Piggybacking are favoured, basedon the current lengthof the SDRAM
write queue. In simulation, the two techniques deliver an average of a 21% reduction
in execution time over in-order scheduling, but with a large variation of performance
when the threshold parameter is adjusted. A general criticism of this approach is that the
architectural parameters are not exposed to the system for runtime optimization. This
means self-tuning approaches are not possible, and the parameter must be chosen at
design time through extensive simulation.
38
CHAPTER 2. DRAMMEMORY FUNDAMENTALS
TheAlteraHigh-PerformanceMemoryController [15, 16] uses amore conservative (less
intrusive) mechanism to improve SDRAM controller eﬃciency. It makes use of the empty
command slots created by burst read and write transactions (which occupy the data-bus
for 2 or 4 cycles, but only require the address/command bus for 1 cycle) to manage the
opening and closing of SDRAM rows for upcoming requests when this activity will not
interfere with the active request being serviced at the head of the request queue. This
can reduce average latency and improve bandwidth eﬃciency where accesses are made
to interleaved banks, since ‘precharge’ and ‘activate’ operations can be overlapped with
‘read’ and ‘write’ operations, hiding the latency which would otherwise cause cycles of
inactivity (NOP operations) on the SDRAM interface. This means fewer cycles in which
a processor connected to the memory controller must stall whilst waiting for data.
The general trend in all the work described so far is towards more out-of-order pro-
cessing to drive the eﬃciency of oﬀ-chip bandwidth. This is in keeping with the trend
identified in Section 2.4.2, which favours solutions that trade silicon area (i.e. memory
queues and associative logic to service those queues), which has an long-term exponen-
tial growth trend, to achieve better performance from a memory interface constrained by
achievable pindensity (whose availability has grownmuch slower). Themostmoderndy-
namic controller designs [17, 18] have deep associative queueswhich coalesce reads to the
the same bank and issuememory requests out-of-order whilst tracking read-write depen-
dencies as they arise to ensure semantically correct operation. In modern microprocessor
designs [19], up to four independent memory channels, each with eight independent
banks are managed within the out-of-order memory controller in each processor to serve
cache-miss requests from up to 16 separate cores.
Two related criticisms of this extreme out-of-order approach are frequently levelled.
The first is the question of whether the memory scheduler optimizes the most critical
memory operations. In general, the memory controller has no information about how
critical specific memory requests are to achieving system performance. This problem
is considered in [20] and [21], which describe memory controllers that guarantee an
allocation of bandwidth and bounded access latency to many requestors in a complex
39
CHAPTER 2. DRAMMEMORY FUNDAMENTALS
system-on-chip. A set of optimized short command templates are defined which can
be dynamically composed into an eﬃcient command sequence. Time-slots are allocated
to diﬀerent on-chip requestors using a dynamic credit-controlled priority arbiter. The
memory controller is made aware of the need to provide fair access to many diﬀerent
requestors through the issuing of credits at a fixed rate to the various requestors. This
allows the controller to guarantee appropriate bandwidth allocations aremade tomultiple
request initiators on a system-on-chip when running a video decoder application.
A second related criticism is whether the relatively short time horizon provided by a
microprocessor queue is enough to determinememory scheduling policy over the lifetime
of a computational task. This question is addressed in [22] and [23, 24] which advocate
the use of multi-level prediction mechanisms (similar to statistical branch prediction
mechanisms) to manage scheduling behaviour over a prolonged period of execution.
In [24], the controller records whether or not each incoming request requires the issuing
of an ‘activate’ command and uses the history of between 5 and 11 previous cycles to
index a table of 2-bit saturating counters. These counters select between ‘open-row’ and
‘closed-row’ row policies with the additional states providing hysteresis behaviour. The
mean overall improvement in execution time of dynamic prediction was 3.7% over a
static ’open-row’ and 19% over static ‘closed-row policy. A strong preference for ‘open-
row’ policies in their results is interpreted as a validation of their decision to bias their
predictors toward ‘open-row’ policies, but since normalized execution time varies by
less than one percent with varying history length, no confident conclusion in favour of
dynamic predictors can be made from this work.
A more ambitious attempt to bring adaptive behaviour (not just short term behaviour)
into a memory controller can be found in [25]. This paper considers ‘self-optimizing’
memory controllers which use a Markov decision process based model with a set of
system states and set of possible actions. A transition probability distribution biases the
non-deterministic transition between states based on the existing state and a selected
action. A reward function defines the expected value of a reward received based on the
action selected in each state. The memory controller is viewed as an ‘agent’ who learns
40
CHAPTER 2. DRAMMEMORY FUNDAMENTALS
an eﬀective long term policy for mapping states to actions which maximises the rewards
received over an infinite horizon (applying a suitable discount factor to tune behaviour
between short-term and long-term planning behaviours).
The actions available to the controller in [25] are fine grained memory operations, e.g.
‘issue a precharge command’, ‘issue an activate command’, ‘issue a write command’.
The policy ‘learned’ by the memory controller is constrained by a set of rules,(which
correspond to fixed values in the transition probability distribution) that guarantee le-
gal behaviour, and guide away from foolish policy choices (e.g. the ‘activation’ and
‘precharge’ of a row without any interleaving ‘read’ or ‘write’ commands). While these
rules prune oﬀ the ‘bad’ actions, some implemented rules such as a prohibition of the
issuing of ‘activate’ requests for rows in which there is no pending request prompt the
question of why the controller could not discover that this action yielded poor rewards,
and whether that policy is always detrimental to performance. The paper demonstrates
their controller design in a software simulation, but gives some fair consideration to the
practicalities of meeting timing closure at practical clock speeds, however it is clear that
the cost of implementing a complex adaptive controller is significant.
In contrast with the adaptive methods in [24], the performance comparison in [25]
shows a much more significant 20% geometric mean improvement over fixed policy
queuing systems and demonstrates consistent linear scaling when the number of parallel
processing cores andnumber of independentmemory channels is increased. Three factors
are likely to explain the success of the approach in [25] over the simpler adaptive behaviour
in [24]. Firstly, the controllers in [25] implement a larger set of state variables that [24].
This allows emergent behaviour to respond to diﬀerent circumstances with optimized
policies rather than coarser grained adaptiveness, where fewer states means behaviour is
less well tuned. Secondly, the expected reward for an action in [25] is itself adaptive, and
over the course of an execution run, is trained to model real-application performance.
This can adapt to real-world performance interactions which are not easily captured in
a simple scheduling policy model. Thirdly the controller in [25] can control the fine-
grained issuing of ‘precharge’ and ‘execute’ commands, which allows row precharge and
41
CHAPTER 2. DRAMMEMORY FUNDAMENTALS
activation latency to be eﬀectively hidden behind foreground read and write operations.
In the technical work within this thesis, adaption to a target application and fine-grained
control over command issue are leveraged to develop high performance application
specific memory controller, albeit in a strategy which uses static compile-time analysis
rather than dynamic runtime adaption.
The memory controllers introduced in this section have steadily introduced more com-
plicated and fine grainedmechanisms to optimize their behaviour for the incomingmem-
ory request stream. The assumption made in all this work is that the controller must seek
to optimize its behaviour based on feedback from past history because it has no prior
knowledge of the running application. While that assumption is fair for many general
purpose computers, embedded computation is often specialized for a specific task. In
Section 2.5.2, we introduce research designed to optimize memory performance where
the target application is known in advance.
2.5.2. Memory Controllers Designed using Static Analysis
Designers can specialize and optimize embedded hardware by designing an application
specificmemoryhierarchy. Onekey area inwhich embeddedmemory systemsoftendiﬀer
fromgeneral purpose computer is in their useof scratchpadmemory to augmentor replace
cached memory systems. Scratchpad memories are on-chip embedded memories whose
contents are managed explicitly by the end-application, rather than through implicit
hardware mechanisms. They may be used to ensure high performance in critical areas
of code (since they guarantee low-latency on-chip memory access) or enhanced security
(since data will never be flushed out tomainmemory across an oﬀ-chipmemory interface
vulnerable to external attack).
Where scratchpad memories have been used within a memory hierarchy, there are
examples of static analysis to determine which specific memory elements are reused. Of
particular note is the work of Darte et al. [26] and Liu et al. [27]. These two works both
explore data-reuse using a polytope model. One develops a mathematical framework to
study the on-chip storage reuse problem, and the other is an application of the technique
within the context of designing a custom memory system implemented on an FPGA,
42
CHAPTER 2. DRAMMEMORY FUNDAMENTALS
but without considering the impact that memory transaction re-ordering can have on
memory performance.
The ‘Connected RAM’ (CoRAM) methodology presented in [28] is notable in that, in
common with the approach we suggest in later chapters, it seeks to decouple commu-
nication and computation threads within an application. Their methodology lacks a
compilation framework to extract communication threads from high level descriptions
and optimally schedule those threads to access external memory. The work presented in
this thesis demonstrates techniques which make this possible.
Other static compile time approaches to improving SDRAM eﬃciency can be found in
[29] where diﬀerent data layouts are used to improve eﬃciency in a image processing
application. A block-based layout of 2D image data is proposed rather than a tradi-
tional row-major or column-major layout and a mathematical mode formulated using
Presburger arithmetic [30] is used to estimate the number of ‘precharge’ and ‘activate’
operations required in the execution of a video benchmark. Their results show a 70-80%
accuracy compared to simulation results and achieve up to 50% energy savings. In [31], a
strategy is proposed for allocating arrays to diﬀerent memory banks to hide the latency of
row activation. Their heuristic approach assumes each logical row of an allocated array
fits within an SDRAM row, an assumption that is likely to be restrictive in handling large
data-sets. While our proposed methodology does not consider bank allocation directly,
we believe it complementary to the concept demonstrated in [31], since by reordering
memory accesses to cluster together accesses to the same row, SDRAM rows which are
accessed consecutively can be allocated to diﬀerent banks with a simple permutation of
address bits.
Application-specific strategies for mapping memory address bits to bank, row and
column structures are considered in [32]. The work considers real-time applications spec-
ified using a minimum acceptable bandwidth requirement, a maximum service latency
and fixed request size and tries to define the set of bank-interleaving options andmemory
access patterns from [21] which meet the bandwidth requirements and a specified power
budget.
43
CHAPTER 2. DRAMMEMORY FUNDAMENTALS
Some work exists which considers the fine grained sequence of SDRAM commands
in a High Level Synthesis approach to hardware design. Dutt, et al. [33] break with
the traditional approach of treating loads and stores as atomic operations in a high level
synthesis system and introduce fine grained SDRAMcontrol nodes into their CDFGbased
synthesis flow. While this enables some performance improvement, a key drawback of
their work is that it is diﬃcult to reason about the number of memory operations and
their sequence from a graph of operations. In this thesis, we address this short-coming
through our use of the Polytope Model. The Polytope Model is used in Chapters 3-5
of this thesis to provide the computation model needed to analyse memory operations.
While we explore the relevant background of the Polytope Model in detail in Chapter 3,
we mention here the two published works which have used the Polytope Model for high
level synthesis and considered memory design.
In [34] and an extended journal version [35], Alias, et. al. demonstrate a high level
synthesis methodology using the Altera C2H [36] tool as a compiler backend. They
implement all their transformations as source-to-source transformations applied to the
original behavioural code description. Considering a loop body, the authors group the
computations within the loop into tiles for execution on a custom hardware accelerator
implemented in an FPGA. They consider explicit block-transfers of data to the imple-
mented hardware accelerators with an awareness of the need to optimize DDR memory
access order to get good performance. Overlapping of communications and execution
is made possible using software-pipelining methods and explicit synchronization and
on-chip double buﬀering of data is enabled using intelligent on-chip memory mapping
techniques from [26].
The authors demonstrate the performance of their optimized system in [34] with a
simple vector-sum application. Their experiments transfer data from external memory to
on-chipmemory in blocks, and show results giving the application speedupwith varying
block-size. The results show that increasing the size of the block of data transferred
allows close to linear speed up until the block size is equal to SDRAM row size. This
is consistent with expectations set by Section 2.3.2 which showed that optimal memory
44
CHAPTER 2. DRAMMEMORY FUNDAMENTALS
systemperformance is achievedwhen thememory systemcan issue back-to-back requests
to a single SDRAM row. Where the work diﬀers substantially from our own is that it
does not explicitly represent SDRAM structure within the Polytope Model. Instead it
implicitly relies on data-locality created by tiling transformations.
In our later work in Chapter 4 and specifically Section 4.5, we explicitly represent the
SDRAM row and column structure in the Polytope Model and optimize to the specific
parameters e.g. row size of our SDRAM device. This enables our tiling approach to
explicitly target the area of best performance observed in the experiments in [34]. It also
enables our approach in Chapter 5 to incorporate knowledge of SDRAM timing param-
eters in determining a legal and eﬃcient overlapping of communications and execution
operations. This means we have an analytical model to explore the timing interaction
between communications and execution, which could only be done in simulation using
the explicit synchronization in [34] and [35].
To the best of our knowledge, our work is the first to propose static analysis of loop
nests for developing optimized hardware address generators for both application-specific
SDRAM-optimizedmemory reordering anddata reuse. It is the first to consider represent-
ing the mapping of memory addresses to SDRAM row and column addresses explicitly
within the PolytopeModel and the first to propose the use of integer counting techniques
for ensuring dependencies are met between concurrent threads.
2.6. Scope of Thesis
Thepreceding literature reviewabove charts the development of SDRAMcontrollers from
simple state-machine baseddesignswith predetermined command sequencing behaviour
to much more complicated out-of-order processors which learn eﬀective behaviour from
system feedback. The area and power implications of implementing such complex con-
trollers are significant, but because of the long-term scaling trends in Section 2.4.2, pin-
constrained devices must increase oﬀ-chip memory bandwidth by eﬃcient command
sequencing. However in embedded devices, where the application is known in advance,
45
CHAPTER 2. DRAMMEMORY FUNDAMENTALS
we have the opportunity to analyse the necessary oﬀ-chip memory accesses and design
eﬃcient memory access sequences at compile time. The tractability of static compile time
analysis has improved with increases in computing power that themselves are enabled
by process scaling [7].
In this thesis, we focus on static analysis for application specific memory controllers
motivated by the observation in Rixner et al. [10] that no static scheduling policy was
optimal across all considered benchmarks. We use a high-level model of computation
provided by the Polytope Model to define the data transfer operations. In light of the
impact shown in Section 2.3, we choose to explicitly represent the SDRAM structure in
our computational framework. By bringing these parameters to the forefront during the
compilation process, we enable compile time analysis and transformation that is aware of
the impact ofmemory request ordering upon application performance. This compile-time
analysis in turn enables more predictable controllers suited to hard real-time applications
common in embedded devices.
This thesis proceeds in Chapter 3 with an in-depth look at the Polytope Model and
examines how it enables eﬀective static analysis. Chapter 4 then uses the concepts to
describe loop transformations which optimize SDRAM behaviour. This is followed in
Chapter 5with amethodology designed to optimize the overlapping of concurrent execu-
tion anddata-transfer operations. Thismethodology explicitly considers the time taken to
transfer data from SDRAM, taking into account the ‘activate’ and ‘precharge’ commands
needed to change row and the device-specific timing parameters which determine overall
memory performance.
46
3. Modelling Memory Accesses using the
Polytope Model
In the previous chapter, we showed how the order in which memory operations are pro-
cessed by an SDRAM controller can significantly impact performance. We described how
many SDRAM controllers use dynamic on-chip structures to buﬀer and reorder opera-
tions, and considered several static methods where researchers have used compile time
analysis to reorder code and improvememory performance. These static methods rely on
prior knowledge of what memory operations must be scheduled within an application.
In this chapter we demonstrate one useful way of expressing computation and memory
access in a formal way using a mathematical framework : the Polytope Model [37, 38] .
Having described a mathematical description of the Polytope Model consistent with
existing literature, we then provide a short survey of how others have leveraged this
formal framework to synthesize code transformations which improve performance while
preserving program semantics.
A key restriction common to all this work is that it can be applied only to portions of
a computer programs which have static control flow; that is the sequence of operations
in a portion of code is invariant across all possible values of input data. This restriction
makes compile-time analysis feasible. In particular, analysis techniques for counting
the number of iterations within a loop nest can be used to evaluate the impact of code
transformations on performance, without simulation. This enables robust guarantees of
maximum execution time to be made, which are useful when designing systems for
hard real-time applications. We provide a description of existing integer point counting
techniques from theory described in [39, 40] and implemented in [41].
47
CHAPTER 3. MODELLING MEMORY ACCESSES USING THE POLYTOPE MODEL
We emphasize to the reader that this chapter is a synthesis of existing literature, and
a prelude to original work described in Chapters 4 and 5. Specifically, we use the loop
transformation techniques introduced in this chapter in Chapter 4 to develop novel hard-
ware synthesis methods which use the Polytope Model to improve SDRAM bandwidth
utilization. Chapter 5 uses the existing techniques for integer point counting described
in this chapter in a novel methodology for safely scheduling memory operation to occur
concurrently with computation.
We begin in Section 3.1 with a description of how the geometry of polytopes (n-
dimensional convex shapes defined by the intersection of half-planes) can be used to
describe the iterations within a loop nest. In Section 3.2, we describe existing work in
which the Polytope Model has enabled static dependency analysis and loop transforma-
tions to improve performance. In Section 3.3, we present existing techniques for counting
the integer points enclosed within a polytope and in Section 3.4 we show how we can
determine an expression giving the number of points in a parameterised polytope. We
conclude with a discussion in Section 3.5 about how we use the techniques presented in
this chapter for the novel technical contributions in Chapter 4 and Chapter 5.
3.1. Introducing Polytopes for Program Analysis
Imperative programs are formed of a sequence of statements. Within each statement, an
assignment may alter some program state and the sequential composition of a sequence
of such statements is used to describe the programmers overall intent. It is not safe to
swap the order of statements within a program without checking whether this alters the
dataflow within the sequence of assignments. We illustrate this using the trivial example
in Figure 3.1(a). We define a function f() containing three statements. Any change in the
statement order gives an undefined result in variable v and alters the return value of the
function. We can assign to each statement a unique numerical label (shown as 0, 1 and 2)
describing their position in the basic block.
48
CHAPTER 3. MODELLING MEMORY ACCESSES USING THE POLYTOPE MODEL
in t t , u , v ;
in t f ( ) {
0 : t = 1 ;
1 : u = t + 1 ;




in t Arr [ 3 ] ;
in t t , u , v ;
void g ( in t t , in t u , in t v ) {
0 : Arr [ 1 ] = u + v ;
1 : Arr [ 2 ] = t + v ;
2 : Arr [ 3 ] = t ∗ v ;
}
(b) Free Schedule
Figure 3.1.: Simple code example showing enumeration of statements within a single
basic block.
Using the example in Figure 3.1(a), we demonstrate that the control behaviour of a
program can be described by :
Enumeration Assigning a unique identifier to each statement.
Scheduling Defining an order in which to execute the enumerated statements.
In this trivial example, each statement in the basic block inside function f() in Fig-
ure 3.1(a) is assigned a unique integer identifier s ∈ Z. The set of statements in the basic
block is a set S where S = { s | s ∈ Z , s ≥ 0 ∧ s ≤ 2 }. A scheduling function σ(s) maps each
enumerated statement to an integer value, the value of which indicates a partial ordering
of the set S. In Figure 3.1(a), the only valid sequence for execution can be described by the
scheduling function.
σ(s) : Z→ Z, s ￿→ s
As a comparison, Figure 3.1(b) gives an example where any scheduling function gives
a reordering of the statements in function g(. . .) which preserves the original semantics
of the program. There is no requirement for the scheduling function to be injective1.
Functions which are not injective can be used to denote operations which can execute
safely and concurrently on parallel processing elements.
1An injective function is a function which preserves distinctness, i.e. if ∀α, β, f (α) = f (β) =⇒ α = β
49
CHAPTER 3. MODELLING MEMORY ACCESSES USING THE POLYTOPE MODEL
in t t ;
for ( x = 0 ; x <= 3 ; x++) {
0 : t += 1 ;
}
Figure 3.2.: Simple Loop.
When code executes within a loop, each loop is associated with a single induction vari-
able which is updatedwith each iteration of the loop. We extend our enumeration scheme
to identify each individual statement with a unique vector which is the concatenation of
the enclosing loop indices and a unique statement reference within a basic block. Wemay
describe this set as an iteration space.
Example 3.1.1. Figure 3.2 gives code describing a loop with a loop iterator x and a loop




 | x ∈ Z , s ∈ Z, s = 0 ∧ x ≥ 0 ∧ x ≤ 3
. In this example, any injective scheduling
function σ(s, x) : Z2 → Z may be used to schedule the statements while preserving the
intended program semantics.
We can enumerate the iterations of a nested loop using a multi-dimensional identifier. A
nested loop with n levels requires an n-dimensional vector to enumerate iterations within
the loop.
Example 3.1.2. Figure 3.3 gives code describing a three-level nested loop containing a









 ∧ s = 0 ∧ 0 ≤ x1 ≤ 1 ∧ 0 ≤ x2 ≤ 1 ∧ 0 ≤ x3 ≤ 1

(3.1)
The examples abovedemonstrate that loopnests canbe formallydescribedbyan integer
set (iteration space), where the elements of the set are implicitly defined by inequality
50
CHAPTER 3. MODELLING MEMORY ACCESSES USING THE POLYTOPE MODEL
in t t ;
for (x1 = 0 ; x1 <= 1 ; x1++) {
for (x2 = 0 ; x2 <= 1 ; x2++) {
for (x2 = 0 ; x3 <= 1 ; x3++) {




Figure 3.3.: Three Level Loop Nest.
in t t ;
for (x1 = 0 ; x1 <= 3 ; x1++) {
for (x2 = 0 ; x2 <= x1 ; x2++} {
0 : t += 1 ;
}
}
Figure 3.4.: Non-Rectangular Loop Nest.
bounds. If the iteration space is bounded, and the bounds can all be represented as aﬃne
functions of the loop variables, the aﬃne hull of the iteration space is a polytope, hence
the loop-nest may be represented in the ‘Polytope Model’. Loop nests represented in
the ‘Polytope Model’ may be non-rectangular, as the loop bounds may depend upon the
value of the iterators of enclosing loops.
Example 3.1.3. Figure 3.4 gives code describing a two-level nested loop containing a




 | x ∈ Z2, s ∈ Z, x =
x1x2
 ∧ s = 0 ∧ 0 ≤ x1 ≤ 3 ∧ 0 ≤ x2 ≤ x1
 (3.2)
51
CHAPTER 3. MODELLING MEMORY ACCESSES USING THE POLYTOPE MODEL
in t t ;
in t A[ 1 6 ] ;
for (x1 = 0 ; x1 <= 3 ; x1++) {
for (x2 = 0 ; x2 <= x1 ; x2++} {
0 : A[4 ∗ x1 + x2 + 5 ] = A[5 ∗ x1 ] + 1 ;
}
}
Figure 3.5.: Non-Rectangular Loop Nest with Array Access.
Each statementwithin the loop nestmay containmemory access(es) to indexed array(s).
The memory accesses within a statement may imply a dependence between diﬀerent
iterations of the enclosing loop nest. This restricts the ability to reschedule the loop
iterations, as altering the order of a read and a write operation, or two write operations
will alter the program semantics. The ‘Polytope Model’ restricts the loop indices to aﬃne
functions of the outer loop indices. This restriction enables data flow analysis using
integer linear programming techniques which can exactly determine the dependencies
which constrain statement schedules.
Example 3.1.4. Example 3.5 gives code describing a two-level nested loop containing a
single statement. The statement stores a value in the array A, with an index determined




+ 5. The set of executed statements is
identical to those described in (3.2) of Example 3.1.3. The memory referencing functions
imply a loop-carried dependencywhich constrains the order inwhich the statementsmay
be executed. One suitable scheduling function is σ(s, x1, x2) = x1, which preserves correct
data-flow but executes all the inner loop operations in parallel. An illustration of this
execution schedule is shown in Figure 3.6.
Together the examples in this section have demonstrated how computation may be
described using :
52



























Figure 3.6.: Parallel Schedule for Loop Nest shown in Figure 3.5.
• Integer sets defined by the conjunction of linear inequalities to define loop iterations.
• An aﬃne scheduling function to define a partial order for those loop iterations.
• A set of aﬃne mapping functions to define read and write operations to memory
arrays.
In the sections that follow, we describe how programs built using these concepts are
useful in embedded and scientific computing applications, how such programs can be
analysed and transformed to deliver high performance.
3.2. Static Control Programs
In the preceding section, we have defined how three key aspects of computation (state-
ments, schedule and storage) can be expressed formally using linear algebra. Specifically,
integer sets indicating the statements within a computation are defined by the integer
points contained within polytopes. These polytopes are formed from the conjunction
of linear inequality constraints. Linear mapping functions describe how statements are
scheduled and where statements should fetch or store data.
53
CHAPTER 3. MODELLING MEMORY ACCESSES USING THE POLYTOPE MODEL
The expression of computation using this form has origins in the work of Karp, Miller
and Winograd [42] and their work on expressing computation and synthesizing systems
from uniform recurrence relationships. The subclass of programswhich can be expressed
with iteration spaces formed from (a union of) integer sets bounded by aﬃne constraints
and statements scheduled using aﬃne maps is commonly referred to as static control
programs [43, 44, 45].
Programs that can be expressed in this restricted form can be analyzed using well-
studied linear algebra techniques [46, 47, 48] that enable exact dependence analysis and
easy code transformation.
Restriction to the class of static control programs specializes the applications of polytope
compiler analyses to certain well suited application domains. Embedded multimedia
applications, front-end telecommunications applications and network components are
identified by Palkovic [49, 50] as ideally suited to expression as static control programs.
In such applications, periodic real-time deadlines are coupled with demanding data-
intensive computation, deep loop nests with many iterations often index large multi-
dimensional arrays and computation is dominated by regular repeated operations on
statically allocated data structures. Here, the performance imperative arises from the
need to reduce manufacturing cost by making eﬃcient use of limited processing power,
limited memory storage and limited energy storage capacity.
A diﬀerent set of constraints arises in scientific computing applications. In this domain,
there are many structural similarities to embedded multimedia applications, but perfor-
mance and power consumption are valued because they enable scaling to tackle large
computational problems rather than meeting the challenge of meeting periodic real-time
deadlines with fixed problem size.
In the two subsections that follow, we introduce methods for analysing dependencies
in static control programs and techniques for code transformation which can be used to
improve computing performance.
54
CHAPTER 3. MODELLING MEMORY ACCESSES USING THE POLYTOPE MODEL
for ( i = 1 ; i <= 10 ; i++) {
a [ i ] = a [ i +10] + 3 ;
}
(a) No Data Dependencies
for ( i = 1 ; i < 10 ; i++) {
a [ i +1] = a [ i ] + 3
}
(b) Loop Carried Dependency
Figure 3.7.: Two Data Dependence Examples. (a) has no dependencies and all statements
may execute in parallel. (b) has a loop carried dependency which forces
sequential execution of all statements.
3.2.1. Analysis of Data Dependencies in Static Control Programs
The set of integer points enclosed within a polytope provides a useful abstraction for
compiler design because geometric transformations of sets can be formally described using
linear-algebra. The power of the polytope model lies in the ability to evaluate the legality
of a change of statement ordering. Within a basic block this may be straightforward,
but it is necessary to determine dependencies which are ‘loop-carried’ so that we can
check whether loop transformations which alter the ordering of basic-blocks preserve the
intended program dataflow semantics.
To illustrate the problem of loop carried dependencies and how they inhibit the re-
ordering of loops, consider the example from [51] shown in Figure 3.7. In every iteration
of the loops shown in Figure 3.7(a) and Figure 3.7(b), a value is written to memory and
read from memory. However in Figure 3.7(a), all the iterations may be executed at once,
whereas in Figure 3.7(b), a sequential execution order must be specified to preserve the
semantic intent within the program.
The general problem of proving dependence is equivalent to integer linear program-
ming, and is therefore NP-Complete. Early methods for data dependence analysis ex-
plored methods for proving independence of loop iterations [52] and conservatively as-
sumed that dependencies prevented reordering if independence could not be proven.
A more ambitious attempt to provide exact dependence analysis is found in [51], which
specifies a hierarchy of algorithmic tests which can determine exact loop dependencies
for all the practical cases found in a common benchmark suite [53].
55
CHAPTER 3. MODELLING MEMORY ACCESSES USING THE POLYTOPE MODEL
However, the consensus was that for generalized polytope loops, exact dependence
analysis using integer linear programming techniques was too expensive for analysis,
except as a method of last resort. In [54], Pugh challenged this wisdom and provided
evidence that while the Fourier-Motzkin variable elimination technique upon which his
work (the Omega Test) was based has worst-case exponential time complexity, for most
practical situations, dependence could be proven with low-order polynomial complexity.
The Omega Test [54] was developed to check if a set of linear constraints (both equality
and inequality) has an integer solution. Problemswere formulated as symbolic Presburger
Formula [30, 55]. Presburger formula are those logical statements which can be built up
out of linear constraints over integer variables, the logical connectives (and, or and logical
negation) and existential and universal quantifiers. Presburger Arithmetic is decidable
[30, 55], that is, for any logical formula expressed in Presburger arithmetic, an algorithm
can determine whether that formula is true or false. The Omega Tool formulated the
problemofdeterminingwhether a loopdependency exists in code as aPresburger formula
and then applies a specialized form of Fourier-Motzkin Elimination [46] to remove the
quantifiers one-by-one. The stumbling point in this work is while quantifiers can always
be eliminated from a Presburger formula, the procedure may require splintering the
polytope into many disjunctive terms to prove whether a dependence exists. As each
variable is eliminated, this may lead to a worst-case exponential growth in the number of
disjunctive terms as the depth of the nested loop increases. This remains a challenge in
the scalability of analysis and code generation algorithms.
3.2.2. Transformation of Polytopes in Static Control Programs
Loop transformations are operations applied to the statements, schedule or storage ac-
cess functions which preserve the original program semantics. Transformations have
been shown to be especially useful in the context of exposing parallelism and improv-
ing memory locality when computing using microprocessors. However, they have also
found application in synthesizing systolic arrays [56] and are essential to achieving good
performance when exploiting the parallel data-processing engines in GPUs [57].
56
CHAPTER 3. MODELLING MEMORY ACCESSES USING THE POLYTOPE MODEL
We can define a class of “Loop Reordering” transformations which can be achieved
solely through manipulation of the aﬃne scheduling function associated with a program
but may be represented as a combination of transformations of both the aﬃne schedul-
ing function and iteration space. These transformations include Loop Skewing, Loop
Permutation and Loop Reversal.
Loop Skewing [58, 59] is a mechanismwhereby the bounds of a given loop are adjusted
to depend on an outer loop. The transformation preserves the number of iterations per-
formed by the program, but will change the order in which those iteration are performed.
The technique has been exploited to improve parallelism within a program. While loop-
carried dependencies between consecutive loop iterations in the inner loop of a program
would normally prevent them from being executed in parallel, loop skewing techniques
have been demonstrated to enable parallel code execution in microprocessors [60], in dis-
tributed multiprocessor machines [61, 62] and in specialized vector multiprocessors [63].
Loop Interchange [64] , Loop Permutation [52] and Loop Reversal have typically been
exploited to improve performance by improving data locality [65]. Where two iterations
access common data, reduction in the number of instructions interleaved between the two
iterations increases the likelihood that data fetched during execution of the first iteration
is still present in the cache when the second data access instruction is executed. This
increases the cache hit rate, improving the average memory performance of the system.
Inmany circumstances, further performance improvementsmaybe realisedwhen these
“Loop Reordering” transformations are combined with “Loop Tiling” transformations.
Loop Tiling [66, 67, 68, 69] splits loops to increase the space of possible scheduling trans-
formations. When combined with suitable Loop Interchange, Loop Tiling can improve
data locality and increase the performance within a microprocessor [43, 59, 70, 71]
Loop Tiling may also be used to partition computation into blocks suitable for concur-
rent computation on parallel processors [72]. Loop Tiling has been used both to improve
data locality within the processors (which each only a subset of data necessary for com-
puting) and to reduce the cost incurred when data is transferred between processors in a
multi-processor system [62, 73].
57
CHAPTER 3. MODELLING MEMORY ACCESSES USING THE POLYTOPE MODEL
Loop Unrolling without other transformation does not change the order of compu-
tation. But if we consider the aﬃne scheduling problem to schedule both basic blocks
and the operations within the blocks (as in Example 3.1.1) then we may consider Soft-
ware Pipelining techniques as Loop Skewing and Loop Interchange transformations on
the combined schedule. Software Pipelining exposes instruction-level parallelism which
may be exploited by a dynamic-reordering processor.
The first works to consider how the loop transformation approaches might be unified
in a formal framework restricted their attention to unimodular transformations [74] of the
iteration space. Unimodular matrices provide a one-to-one mapping between iterations
in the original and transformed iteration spaces. A theory addressing more general
transformations emerged later, alongside tools designed to construct optimized schedules
using the freedomprovided bymore expressive transforms. It is thiswork, encompassing
all the concepts introduced in Section 3.1 which became known as the Polytope Model.
Many of the loop transformations identified here are designed to exploit parallelism or
improve data locality in caches to improve program performance. Memory subsystems
in embedded devices often include on-chip scratchpad memory. Rather than rely on im-
plicit movement of data from external memory into on-chip caches, scratchpad memory
requires explicit control to fetch data into on-chip memory and write it back to external
memory. This provides an opportunity to optimize memory usage by a space-eﬃcient
remapping scheme which considers both the spatial properties of memory addresses re-
quested in a target program and the temporal schedule of statements to share on-chip
memory locations between many oﬀ-chip storage locations. In [75], a formal definition
of the conditions which must be met for a legal remapping of memory locations is dis-
cussed and a proposed mapping solution using multi-dimensional modulo-mappings is
proposed.
3.2.3. Code Generation from Polytopes
Having applied a transformation to a polytope, it is necessary to reconstruct a nested loop
from the transformed representation for implementation in a backend flow (e.g. software
58
CHAPTER 3. MODELLING MEMORY ACCESSES USING THE POLYTOPE MODEL
or hardware compilation). The problem of code generation from the polytope model was
first considered by Irigoin and Ancourt [76] who proposed a method based on Fourier-
Motzkin elimination which enabled the construction of a new nested loop representation
of a transformed polytope. A method for eﬃciently scanning unions of polytopes in
presented in [77] with a variant of the algorithm they developed implemented in the
polyhedral code generation tool, Cloog [43].
3.3. Counting Integer Points in Polytopes
Through the specification of an iteration space describing a set of statements (which may
contain memory accesses, indexed by aﬃne functions of the loop variables), and a valid
scheduling function, we have the mathematical description suﬃcient to reconstruct the
intent of the original designer. However, wemaywant to prove non-functional properties
hold for the generated code. For example, wemaywish to bound the execution time for a
loop nest to guarantee we meet periodic deadlines, or determine the quantity of memory
accessed during execution. For this we may use integer point counting algorithms.
A polytope encloses a discrete number of integer points. For many applications (e.g.
proving the legality of a transformation), it is useful to determine the properties of a
polytope (e.g. emptiness) or count the number of integer points inside a polytope. Integer
point counting techniques which build upon theory attributable to Ehrhart [78] provide a
powerfulmechanism for reasoning about polytopeproperties. Somematerial is presented
here, based upon [79] and [41] to aid the reader in understanding its practical application
in Chapter 5.
3.3.1. General Approach
Approaches to counting the integer points contained within a polytope are based upon
encoding the integer points inside a polytope as multivariate generating functions. In
(3.3), we show the general form of a generating function that encodes the integer points
within a polytope P. These points are a subset of d-dimensional space. Each integer point
59
CHAPTER 3. MODELLING MEMORY ACCESSES USING THE POLYTOPE MODEL
is represented by a single monomial in the summation.
f (P; z) =
￿
a∈Zd
g(a) za where za = z1a1z2a2 . . . zdad (3.3)
if g(a) is an indicator function such that :
g(a) =

1 if a ∈ P,
0 otherwise
Then we can determine the number of points in the polytope by evaluation of (3.3)
after substituting z = 1 where 1 is a vector of length d such that 1 =
￿
1, 1, . . . , 1
￿
. Using
this substitution, each integer point inside the polytope contributes to the summation.
Example 3.3.1 shows the representation of a simple interval in one dimension.
Example 3.3.1. Let P be a subset of the natural numbers P ∈ Z such that P contains exactly
the elements in the range [0 : 3]. We illustrate this as the points on the number line in
Figure 3.8.
x0 3
Figure 3.8.: Number line representing P = [0 : 3] ∩Z.
The set can be encoded in the generating function f (P; z) as in (3.4)
f (P; z) = 1 + z + z2 + z3 (3.4)
If we substitute z = 1 into the generating function, we can ‘count’ the number of points
in the interval.
f (P; 1) = 1 + 1 + 1 + 1 = 4
60
CHAPTER 3. MODELLING MEMORY ACCESSES USING THE POLYTOPE MODEL
Using the general form in (3.3), we can represent transformations of the polytope using
algebraic manipulation of the monomial terms.
For example, multiplying each term in (3.4) by z4 gives :
z4 f (P; z) = z4 + z5 + z6 + z7 (3.5)
which represents shifting the interval.
Disjoint unions of two setsC1(z) andC2(z) may be obtained by summing themonomials
in both sets. An intersection of two sets, C1(z) and C1(z), may be calculated by the








Then the intersection D(z) of the two sets is given in (3.6).







It is ineﬃcient to perform these set operations on the individual monomials which
represent an integer set. A more compact representation of the terms, and eﬃcient
algorithms formanipulating themcanbe foundbyconsidering rational expressionswhose
expansion gives the desired monomial sequence. As an example, the sum of a geometric
series is the summation of an infinite series of monomials, yet, it has the compact rational
expression given in (3.7).
1
1 − z = 1 + z + z
2 + z3 . . . (3.7)
We can use rational functions to find a compact representation of the integer points
within a polytope. In Example 3.3.2, we show how the bounded set of integer points in
the interval [0 : 3] can be represented using the combination of two unbounded power
series.
61
CHAPTER 3. MODELLING MEMORY ACCESSES USING THE POLYTOPE MODEL
Example 3.3.2. A compact representation for the sum of a geometric series is given in
(3.7). If we multiply that representation by x4, we shift the representation, giving
z4
1 − z = z
4 + z5 + z6 + z7 . . . (3.8)
subtracting (3.8) from (3.7) gives
K1￿￿￿￿
1
1 − z −
K2￿￿￿￿
z4
1 − z = 1 + z + z
2 + z3 (3.9)
The discrete points in our bounded interval are therefore represented by the diﬀerence
of two infinite series summations as in (3.9). These two series summations (K1 and K2)





Figure 3.9.: Number line representing P = [0 : 3]∩Z and a decomposition into two series.
In this way, we can show that the terms of a generating function can be represented
using short rational functions. This allows us to eﬃciently perform set operations on
integer sets using algebra. Given two disjoint sets, the addition of their short-rational
functions yields a representation of the union of the elements of the two sets. Calculating
the intersection of two sets is performed by calculating theHadamard Product of the short
rational generating functions representing each set. An eﬃcient algorithm for computing
the Hadamard Product of two sets using monomial substitutions is presented in [39].
Together these algorithms allow other set operations such as set diﬀerence and the union
of non-disjoint sets to be calculated without full enumeration of the set elements.
In Example 3.3.3, we illustrate that a general approach to forming rational generating
functionsmay be found by considering the supporting cones of a polytope at each polytope
vertex.
62
CHAPTER 3. MODELLING MEMORY ACCESSES USING THE POLYTOPE MODEL
Example 3.3.3. Taking Example 3.3.2 and specifically (3.9) and by multiplying the de-
nominator and numerator of the second term by −x−1, we get the expression shown in
(3.10) and illustrated in Figure 3.10. This shows supporting cones centered at each of the
two vertices, (0) and (3) of our interval . The sum of the generating functions of the two
supporting cones gives a generating function enumerating the points within the interval.
1
1 − z −
z4
1 − z =
K3￿￿￿￿
1
1 − z +
K4￿￿￿ ￿￿￿
z3
1 − z−1 = 1 + z + z





Figure 3.10.: Representation showing supporting cones on number line.
Example 3.3.3 is a practical use of Brion’s Theorem [80, 81, 82]. Brion’s Theorem states
that the rational function representation of a polytope f (P, z) may be computed by the
summation of the rational representation of the tangent cones (or following [81], the
‘forward cone’) at each of the polytope vertices.
Example 3.3.4 shows the tangent cones at each vertex of a triangle












. The supporting cones at each of these vertices is shown in Figures
3.11(a)-(c) respectively. In each figure, the integer points enclosed by the cone are shown
in red.
Fundamental to the principle of counting the integer points enclosed by a polytope is
the decomposition of that polytope into supporting cones. In the section that follows,
we show, using an approach from [83], how the integer points within a cone can be
enumerated.
63











1 2 3 4 5











1 2 3 4 5











1 2 3 4 5
(c) Cone K3 at v3
Figure 3.11.: Representation of two-dimensional supporting cones.
3.3.2. Counting Integer Points in Cones
A cone (which forms a subset of Rd) is defined as the non-negative combinations of a set
of d-vectors (generators). A cone K, with generators µ1, µ2, . . . µk ∈ Zd is a set of the form
K =
￿
x | x = ￿k λkµk ,λk ∈ R λk ≥ 0￿where λk is non-negative.
From this formal definition of a cone, we may define a further subset of the cone, the
fundamental half-open parallelepiped (
￿
) as in (3.11).
￿
= λ1µ1 + λ2µ2 + . . . + λdµd 0 ≤ λi < 1, i = 1 . . . d (3.11)
Example 3.3.5 shows a two-dimensional example of a cone with two generators and
highlights the fundamental half-open parallelepiped within the cone.
Example 3.3.5. Let K be a cone in R2 centered at the point v = (3, 3). The cone has
generators µ1 = (−1,−1) and µ2 = (0,−1). The cone is shown in Figure 3.12(a). The set of
integer points within the cone are defined as (3.12)
K ∩ Zd =




 , λi ∈ Z ∧ λi ≥ 0, i = 1 . . . 2
 (3.12)
The fundamental parallelepiped of the cone is shown in Figure 3.12(b) and contains a
single integer point at (3,3).
It is shown in [83] that the fundamental parallelepiped can be tiled infinitely across the
plane, with each tile shifted by some positive integer combination of the cone generators.
The tiling covers the plane and neither the tiles nor the boundaries of the tiles overlap.
If there are k generator vectors (µ1, µ2, . . . , µk) then this tiling may be represented by a
64









Figure 3.12.: Representation of (a) two-dimensional cone and (b) Fundamental Paral-
lelepiped within that cone.





(1 − zµ2) = (1 + z
2µ1 + z3µ1 + . . .)(1 + z2µ1 + z3µ12 . . .) (3.13)
Therefore, if we can find an expression for the integer points within the fundamental
parallelepiped, we can tile the representation to represent the integer points inside the
cone. The problem of how many number of integer points lie within the fundamental
parallelepiped is simple when the cone is unimodular. Cones which are unimodular have
generators which form a basis of the unit lattice i.e. they can be formed using column
operations [47] on the identity matrix. These cones are guaranteed to have only a single
point in their fundamental parallelepiped (for proof, c.f. [83]). Therefore a unimodular
cone at an integer vertex v can be represented as in 3.14






Avery significant contribution to thefieldwas aproof in [39] that non-unimodular cones
can be decomposed into the sum of unimodular cones in polynomial time. This means
that the generating function for any convex integral2 polytope P may be represented
2An equivalent derivation for rational polytopes is given in [41]
65
CHAPTER 3. MODELLING MEMORY ACCESSES USING THE POLYTOPE MODEL
as (3.15) where V is the set of polytope vertices, Kv is a set of unimodular cones at each
vertex obtained using Barvinok’s decomposition technique [39] and ￿ j is a weighting for
each cone, and µi j is one of d generators associated with each cone in Kv.







k=1 (1 − zµik)
(3.15)
Having obtained a generating function in this form,we are now in a position to evaluate
the value of the function at z = 1 and hence determine the number of integer points inside
a polytope.
3.3.3. Evaluating Generating Functions
The theory presented in the preceding section means the integer points within a poly-
tope can be represented in a general form as in (3.15). We wish to substitute z =￿
z1, z2, . . . , zd
￿
, zi = 1, i = 1 . . . d into the generating function to evaluate how many in-
teger points lie within the polytope. This is mademore diﬃcult since a direct substitution
creates poles in each of the terms of the generating function. Instead, following the ap-
proach in [79] and [41], the generating function can be evaluated by first transforming
each term from a multivariate function into a univariate function using a monomial sub-
stitution and then taking a Laurent series expansion around 1. All the coeﬃcients of the
negative powers across all the terms will cancel out and the analytical value at z = 1 is the
sum of the constant term in the Laurent expansion of each of the individual terms. This
is exactly the number of integer points enclosed within the polytope. For the interested
reader, we illustrate this approach on a practical example in Appendix B.
66
CHAPTER 3. MODELLING MEMORY ACCESSES USING THE POLYTOPE MODEL
3.4. Counting Integer Points in Parametric Polytopes
A parametric polytope is a polytope with constraints which vary with one (or more)
variables. In general a parametric polytope may be described using
P =
￿
x | Ax + Bp ≤ b￿
A parametric polytope may vary in size as the parameter p changes. The vertices of
the polytope may change with the parameter p and some vertices may be dominated by
others in some regions. This means diﬀerent sets of parametric vertices may be active as
the parameter p changes. Example 3.4.1 demonstrates this.






























(c) P(p) when 0 < p ≤ 2
Figure 3.13.: Chamber decomposition of Example 3.4.1.
This example has three possible sets of vertices in which the value of p gives a non-
empty parametric setP(p). If p ≤ −2 as shown in Figure 3.13(a) then the setP(p) is bounded
by the four vertices V1 = (0, 0), V2 = (2, 0), V3 = (2, 2) and V4 = (0, 2). If −2 < p ≤ 0 as in
Figure 3.13(b) then P(p) is bounded by the five points V1 = (0, 0), V2 = (2, 0), V3 = (2, 2),
V5 = (−p, 2) and V6 = (0,−p). If 0 ≤ p ≤ 2 as in Figure 3.13(c) then P(p) is bounded by
67
CHAPTER 3. MODELLING MEMORY ACCESSES USING THE POLYTOPE MODEL
three vertices, V7 = (p, 0), V2 = (2, 0) and V8 = (2, 2 − p). All other values of p give an
empty set.
This discrete combination of active vertices is referred to as a chamber decomposi-
tion of the polytope. The work in [41] extends the non-parametric counting techniques
in [39, 79, 41] and can be used to determine a functionwhich, given the parameter variable
p as an input returns an analytical expression for the number of integer points enclosed
within that parametric polytope. The method, implemented in the Barvinok library [84]
uses the integer point counting approach explained throughout this chapter, lifting the
original polytope description into a higher dimensional space containing both polytope
dimensions (x) and parametric dimensions (p). The generating function representation
of the integer points within this higher dimensional polytope can be partially evaluated
within each discrete chamber of the chamber decomposition. The partial evaluation gives
a quasi-polynomial for each of the disjoint sets whichmake up the chamber decomposition.
These quasi-polynomial expressions are polynomial function with coeﬃcients which pe-
riodically depend upon the parameters p. In [41], it is shown that quasi-polynomials
can be alternatively represented as polynomials with coeﬃcients which are expressed as
the fractional parts of linear expressions. Functions designed to count the integer points
within a parametric set are said to have a piecewise quasi-polynomial form.
We use parametric integer set counting techniques in Chapter 5 to determine the exact
cycle in which each iteration with a polytope is scheduled.
3.5. Discussion
The two preceding sections describe how program structure may be described, analysed
and transformed using the polytope model, and specifically how we many count the
number of integer points within a polytope. This knowledge is used in Chapter 4 and
Chapter 5 to describe novel methods for improving memory performance.
In Chapter 4, we use the polytope description, extracted from a kernel of imperative
code, to describe thememory read andwrite accesseswithin a program. We introduce two
68
CHAPTER 3. MODELLING MEMORY ACCESSES USING THE POLYTOPE MODEL
methods of optimizing the ordering of thosememory accesses using loop transformations.
In Chapter 5, we consider two program threads, one dedicated to memory access in a
optimized order and the other executing code without any reordering transform. The
integer point counting techniques presented above are used to schedule the two threads,
with minimal delays inserted to ensure correct data flow within the program.
69
4. Design of Parametric DRAM Controllers
using the Polytope Model
4.1. Introduction
We wish to develop hardware which generates an optimized memory access sequence at
runtime. The key challenges in doing this are:
• Creating a fast hardware sequencer implementationwhich can saturate the through-
put of the memory controller
• Optimizing the sequence of addresses to improve SDRAM bandwidth eﬃciency
These two goals may be antagonistic; more complex controllers may generate eﬃcient
sequences which cannot be implemented in a way which achieves high throughput.
Furthermore, there is a trade-oﬀbetween the quality of the sequence generated (quantified
in terms of external memory bandwidth eﬃciency) and the amount of on-chip memory
dedicated to buﬀering data.
In this chapter, we use the Polytope Model introduced in Chapter 3 to construct hard-
ware memory access generators for memory address scheduling which exploit structure
in the original high level code. We explore a design spacewithinwhich solutions trade-oﬀ
on-chipmemory utilization for improved oﬀ-chipmemory performance.
In doing so, we exploit a long term scaling trend, trading oﬀ a resource (device pin-
density) which historically has not scaled exponentially in line with Moore’s Observa-
tions [7] for resources (on-chip memory and logic density) which have.
70
CHAPTER 4. DESIGN OF PARAMETRIC DRAM CONTROLLERS USING THE
POLYTOPE MODEL
We focus our attention on multi-level nested loops. A high performance pipelined
sequence generator is used to scan the outer indices of a loop nest, while accesses in the
innermost loops are optimized using two novel techniques. The two novel techniques
embody two alternative approaches to sequence generation. In both techniques, we
identify the set of memory addresses accessed within a loop nest at compile time and
optimize the memory access schedule to exploit reuse of data items buﬀered in on-chip
memory. In both techniques, thememory accesses are reordered tominimize the overhead
incurred when activating and precharging SDRAM rows. In both techniques, a trade-oﬀ
is implied between the amount of on-chip memory used for on-chip buﬀering and the
amount of data transferred on the external memory interface.
The two approaches we describe in this chapter in Section 4.4 and Section 4.5 diﬀer in
the constraints they place on the memory schedule, and the structure of the generated
hardware sequence generator which implements the memory schedule. In the first pre-
sented technique in Section 4.4, the memory access sequence is constrained to be strictly
monotonic and the hardware address generator implements a recurrence relationship,
taking an address as input and returning the next address in the sequence. This method
produces a schedule that implicitly gathers together consecutive accesses to SDRAM
rows and bursts to reduce the number of SDRAM ‘activate’ and ‘precharge’ commands
necessary in the command sequence.
The second technique in Section 4.5 explicitly represents the SDRAM rows and bursts
as new variables in the Polytope Model and allows more general schedules to be applied.
Unlike in our first technique, the hardware generated from this approach has no data
feedback, and is therefore amenable to implementation as a deeply pipelined hardware
circuit which can be run at high clock frequencies.
We begin this chapter in Section 4.2 with a description of how we can, for a specified
problem, automatically derive a class of designs which use on-chip memory buﬀering
to increase memory access eﬃciency. Within that class of designs, a single parameter
t, can be used to trade-oﬀ increased on-chip buﬀer size for improvements in memory
access eﬃciency. We follow with a description in Section 4.3 of how a sequence of outer
71
CHAPTER 4. DESIGN OF PARAMETRIC DRAM CONTROLLERS USING THE
POLYTOPE MODEL
loop indices can be eﬃciently generated before giving a description of each of our two
novel methods for generating optimized inner loop address sequences in Section 4.4 and
Section 4.5 respectively. We evaluate eachmethod in turn and evaluate their performance
and circuit area trade-oﬀs by implementing three simple example applications. We follow
this with a comparison of the two methods in Section 4.6, evaluating their relative merits
in developing high-performance memory systems.
4.2. Decoupling Memory Access from Execution using On-Chip
Memory
In any useful program, there is a mix of read accesses and write accesses. As indicated
in Chapter 3, the freedom to reorder the statements executing in a program is restricted
by the data dependencies within the program: the true read-after-write dependencies
within the code prevent arbitrary reordering of the statements within the program loop
structures.
We wish to optimize memory bandwidth by reordering memory instructions without
altering the execution order of program statements. We can tackle this problem by
separating a program into separate communication and execution threads which share
data through an on-chip memory buﬀer. In this section, we demonstrate how a single
parameter,t, can be used to produce a family of architectures with on-chip memory
requirements. In doing so, we illustrate three mechanisms which allow us to trade on-
chip buﬀer space for improved memory bandwidth eﬃciency through a reduction in
bus-turnaround cycles, the exploitation of data-reuse, and the reordering of memory
requests to reduce the number of ’activate’ and ‘precharge’ cycles needed to change the
active memory row.
As an illustrative example, consider the two-level loop nest shown in Figure 4.1. Fig-
ure 4.2(a) shows one possible way in which the code can be transformed into two threads
which communicate using on-chip memory.
The schedule of the threads in Figure 4.2(a) shown in Figure 4.3(a) sees all the data
72
CHAPTER 4. DESIGN OF PARAMETRIC DRAM CONTROLLERS USING THE
POLYTOPE MODEL
for ( i = 0 ; i < 2 ; i++ ) {
for ( j = 0 ; j < 2 ; j ++) {
a [2 i+ j ] = a [2 i + j + 4] + 3 ;
}
}
Figure 4.1.: Two level nested loop example.
required for program execution loaded into an on-chip memory buﬀer before execution
begins. The dotted lines indicate the presence of some barrier synchronization methods
which ensures synchronization. During program execution, data items are written and
read from the on-chip buﬀer in the execution thread and after execution completes, the
data items written during execution are buﬀered and written back to memory. In this
chapter, we assume that kernel execution and data transfer into on-chip buﬀers does not
occur simultaneously, although this condition is relaxed in Chapter 5 where we describe
a technique for concurrent execution of parallel execution and communication threads.
The synchronization barriers shown in Figure 4.3 which synchronize the two threads to
prevent concurrent execution could be implemented using an explicit handshakemethod
between the memory and execution thread, or by pre-compiled static scheduling based
on a common clock signal.
Figure 4.3 shows the memory access sequence for the three diﬀerent example imple-
mentations shown in Figures 4.2(a)-(c). In Figure 4.2(a), the read accesses in the inner
loop are sequenced before execution begins and data is written back to memory after
execution completes. In many cases, a simple communication/execution split like the one
shown in Figure 4.2(a) and Figure 4.3(a) is infeasible, because the on-chip memory is not
large enough to simultaneously store all the data accessed within a loop nest.
73


































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































R6 R7 W0 W1 W2 W3R4 R5
(a) Parameterization with t = 1
communication_thread
execution_thread
R6 R7R4 R5 W0 W1 W2 W3
time
(b) Parameterization with t = 2
communication_thread
execution_thread
R6 R7R4 R5W0 W1 W2 W3
time
(c) Parameterization with t = 3
Figure 4.3.: Execution schedule showing communication and execution under three dif-
ferent parameterizations.
In such a case, rather than introduce a on-chip memory buﬀer to hold all the data
accessed in the loop nest, we can select a smaller on-chip buﬀer to store only the set
of memory items corresponding to some specific iteration. In this scenario, all the data
accessed within an outer loop (or outer loops) can be loaded into on-chip memory, exe-
cution of the inner loops can progress and when complete, data written during execution
of those inner loops can be written back to memory before repeating for the next iteration
of the outer loop(s).
We can introduce on-chip memory buﬀers for decoupling memory access from execu-
tion at any level of the loop nest and use the parameter t to denote the level at which a
buﬀer is introduced. Where t = 1, this denotes that we introduce a buﬀer at the outermost
75
CHAPTER 4. DESIGN OF PARAMETRIC DRAM CONTROLLERS USING THE
POLYTOPE MODEL
level of the loop nest and must prefetch all the data required for execution of the loop
nest before execution begins. At the opposite extreme, parameterisation where (t = n+ 1)
indicates introduction of a buﬀer large enough to contain just the elements accessed in
a single iteration of the innermost loop. Figures 4.2(b) and 4.2(c) show two alternative
ways of scheduling the loops, with buﬀers inserted outside the innermost loop (t = 2)
and within the innermost loop (t = 3) representing the original code sequence.
Thememory access schedules for each parameterisation are shown in Figures 4.3(a)-(c).
The three parameterizations require diﬀerent amounts of on-chip memory, the parame-
terisation which introduces an on-chip buﬀer outside the outermost loop (t = 1) requires
eight words of on-chip memory whilst the (t = 2) and (t = 3) parameterizations require
four and two words of on-chip memory respectively. While not considered within this
thesis, standard loop-splitting transformations [66, 65] can be used to give the user even
finer control over the size of the required on-chip data buﬀer.
When the externalmemory interface is to SDRAM,data is transferred on a tri-stated bus.
Activation of the bus drivers means a penalty is incurred for transition from performing
read operations to performing write operations and vice-versa. In the three memory
access schedules shown in Figure 4.3, we see a single transition from read to write
operation in Figure 4.3(a) whilst Figures 4.3(b) and 4.3(c) show three and seven transitions
respectively. This illustrates thefirst of three trade-oﬀs between theuseof on-chipmemory
and eﬃcient use of the oﬀ-chip memory interface, a reduction in the number of bus
turnaround cycles when buﬀers are introduced at the outer levels of the target loop nest.
The second trade-oﬀ is the exploitation of data-reuse. Using the notation introduced in
Chapter 3, we denote the set of operations in the execution thread as SE. At the innermost
level of the loop, there are memory operations which acccess a set of data items SM.
Because multiple iterations within the inner loop may access the same data item, the
cardinality of the set SM may be smaller than that of SE. When buﬀers are introduced
at the outermost loop level (t = 1) and an eﬃcient method is used to transfer data into
the on-chip buﬀers, the total number of accesses to external memory may be significantly
reduced. At the innermost level, where no data is buﬀered (t = n+ 1), every iteration will
76
CHAPTER 4. DESIGN OF PARAMETRIC DRAM CONTROLLERS USING THE
POLYTOPE MODEL
generate an access to external memory for all memory referencing functions.
In the original execution order specified in source code, multiple iterations within a
loop nest may access the same data in memory. We refer to this as data reuse. Data reuse
may be exploited to reduce the quantity of data fetched from external memory. Memory
which is read multiple times may be fetched once into an on-chip buﬀer, and subsequent
accessesmayuse the on-chipbuﬀereddata, avoiding additional oﬀ-chipmemory accesses.
Memory writes may also be optimized, we may buﬀer write accesses in on-chip memory
and commit only the final memory access to each address to external memory.
The thirdmechanism is reorderingmemoryaccesses to reduce thenumberof ‘precharge’
and ‘activate’ commands used to change memory row. For some chosen level of param-
eterisation, the set of read accesses which populate the on-chip buﬀer for a specific outer
loop iteration vector can be arbitrarily reordered. Read-After-Write dependencies are
guaranteed to be satisfied under the condition that all the read accesses required to fill the
on-chip buﬀer occur before execution begins and the write accesses required to commit
results are scheduled after execution and before data is fetched for the next iteration.
Write-after-Write dependencies which occur within a single memory reference (i.e. many
separatewrites to a single address can be generated from a singlememory reference using
diﬀerent loop indices) are resolved in the on-chip memory. 1
In the sections that follow, we present methodologies for exploiting these three mech-
anisms, reduction of the number of cycles dedicated to bus turnaround, reduction in
the number of memory accesses through the exploitation of data reuse, and reorder-
ing of memory transactions to minimize the number of row ‘activate’ and ‘precharge’
commands. In two distinct methodologies we exploit the three performance-improving
mechanisms and generate address sequencing hardware to eﬃciently populate the on-
chip memory buﬀers and write back data produced by the datapath. We evaluate each
methodology using diﬀerent parameterisation levels, introducing buﬀers at diﬀerent lev-
els to evaluate the trade-oﬀ between on-chip buﬀer size and SDRAM interface eﬃciency.
1Write-after-Write dependencies between diﬀerent memory references imply additional constraints on the
ordering of oﬀ-chip memory accesses.
77
CHAPTER 4. DESIGN OF PARAMETRIC DRAM CONTROLLERS USING THE
POLYTOPE MODEL
4.3. Designing High-Performance Hardware for Polytope
Scanning
In the previous section, we described howwe can split a loop nest into two parts, an inner
set of nested loops for which memory contents are buﬀered on chip and an outer set of
loops preserving the original program structure. In this sectionwe describe how to design
high-performance hardware to sequence the iterations of an aﬃne loop nest. We refer
to this as ‘scanning’ the outer loop nest. This operation is common to the parameterised
designs under both techniques presented in this chapter, each of which has a outer loop
iteration block and optimized inner loop for eﬃcient memory access. Hardware for
scanning the iterations of the outer loops is derived directly from the bounds in the
original program. Because these outer loops come directly from the original source code,
the upper and lower bounds of each loop may be constants or aﬃne functions of the loop
iterators.
Figure 4.4 shows the hardware for scanning a single loopwith constant bounds. Where
n is the number of loops in the target loop nest, and t is the parameterisation level which
determines where reuse buﬀers are inserted into our generated architecture, the outer
loop scanning hardware requires one hardware ‘loop’ block for each of them = n− t loop
levels which much be scanned.
In the more general case where loops have aﬃne upper and/or lower bounds, each of
the hardware loop blocks generates an output xi where i < m and each must take in the
[xi−1, xi−2, . . . xo] iterators to derive upper and lower loop bounds. The set of iterations
to be scanned is a bounded set of integer vectors. We can use that information to select
optimal word lengths for the loop iterator signals xi and loop bound evaluation logic.
Figure 4.5 shows hardware for deriving loop bounds with non-constant bounds. We
can increase the throughput of this circuit by adding pipeline stages. Registers are added
at the output of the logic used for evaluating upper and lower aﬃne bounds. Synthesis
tools [85] are able to apply ’physical synthesis’ techniques (i.e. knowledge of the layout
of implemented circuits) to move those registers and balance the pipeline stage delays.
78




































Figure 4.5.: Aﬃne Unit.
79
CHAPTER 4. DESIGN OF PARAMETRIC DRAM CONTROLLERS USING THE
POLYTOPE MODEL
The hardware shown in Figure 4.5 is non-implementable because it contains anti-delay
blocks which depict advancement of a signal in time. However, through application of
counter-flow pipelining techniques such as those in [86], we can push those delays back
through the cascaded ‘loop’ blocks. The loop bounds at the outermost level must be
constants. They can therefore absorb any delay or anti-delay elements. Pipeline opti-
mization in this way allows us to automatically construct control logic to manage the
filling, operation and drainage of data from a pipeline. This ensures that the cascaded
‘loop’ blocks are always realizable with an arbitrary depth of pipelining, whilst maintain-
ing a single-cycle throughput rate. The initial latency of the circuit varies linearly with

















Figure 4.6.: Multiple Unit.
to generate the outer loop indices. The hardware implementation has parallel evaluation
of the loop bounds which means it can generate a new iteration every cycle. This is a
capability which cannot practically be realised within a software implementation.
In this section, we have described the implementation of logic for scanning the outer
loops of a loop nest under diﬀerent parameterizations. In the section which follows, we
provide details of the first of two approaches for optimizing the memory accesses which
occur within the inner loop.
80
CHAPTER 4. DESIGN OF PARAMETRIC DRAM CONTROLLERS USING THE
POLYTOPE MODEL
4.4. Reordering Inner Loop Memory Accesses Using Strictly
Monotonic Memory Scheduling
The representation of computation in the PolytopeModel gives a bounded set of iterations
(x ∈ SE) and a set of memory access functions ( f (x) = fx + h) which associate each
iteration with some set of data to be read from memory and/or some set of data that
must be written to memory. In this section, we apply the technique in Section 4.2,
splitting the loop indices into an outer and inner set of indices and creating two threads,
a memory thread and an execution thread, communicating through an on-chip memory
buﬀer. We wish to automatically derive an application-specific memory controller which
manages the transfer of data between external memory and on-chip memory, exploiting
the three mechanisms described in Section 4.2 to eﬃciently use limited external memory
bandwidth.
In the first of two methodologies for doing this presented in this chapter, we apply a
strictlymonotonicmemory scheduling policy to the addresses generated by eachmemory
function, f (x).
Given the set of memory accesses (SM) defined by the iteration space and associated
with a specific memory access function f (x), one suitable memory scheduling function
for those memory items iterates through the set in strict ascending order. This scheduling
order ensures that data reuse is exploited, since the ‘strict’ qualification of the ordering
means memory addresses may only be fetched from memory once. The ‘monotonic’
nature of the scheduling policy ensures that each SDRAM row is only accessed once,
hence the minimum number of row ‘activation’ and ‘precharge’ commands are issued.
The challenge in finding an eﬃcientmemory sequencer is that the image of the iteration
space under an memory access function is neither injective (many iterations may access
the same data item) nor dense (a projection of the iteration space bounds spans a range
of addresses, some of which may not be accessed in the original program).
In the sections which follow, we give an illustrative example and show how the ap-
plication of a strictly monotonic scheduling policy gives an optimized memory access
81
CHAPTER 4. DESIGN OF PARAMETRIC DRAM CONTROLLERS USING THE
POLYTOPE MODEL
sequence. We then show a method for automatically constructing an eﬃcient address
sequencer using proven linear optimization techniques.
4.4.1. Motivating Example
As a ‘toy’ illustrative example, consider the code shown in Figure 4.7(a). Two dimensional
iteration vectors [i, j] represent each iteration of the statement within the loop nest. Fig-
ure 4.8 shows how each of these vectors generates a memory address. The iterations [0, 1]
and [2, 0] both access the samememory address. This data reuse by diﬀerent iterations can
be exploited by only loading the data at address 4 once. Furthermore the example contains
‘holes’; the addresses 1, 3, 5 and 7 are never accessed and need not be loaded. Finally, the
access order implied by the ordering of the loop iterations implies that a non-monotonic
sequence of addresses is generated by this code. Such a sequence implies that in the
presence of page breaks, unnecessary ‘precharge’ and ‘activate’ commands are gener-
ated to swap between rows. Enforcing monotonicity on the address sequencing function
minimizes the number of ‘precharge’ and ‘activate’ commands needed, improving over-
all bandwidth eﬃciency. The code in Figure 4.7(b) is synthesized from Figure 4.7(a)
using our methodology. This code generates a sequences of addresses in the variable
(R￿). When implemented in hardware, this code produces the monotonic sequence of
addresses 0, 2, 4, 6, 8 in sequential calls to the ‘generate memory address’ function. In
this sequence, we have the same set of memory addresses as in the initial problem de-
scription, but the sequence has no repetition of addresses and a strict monotonic sequence
that guarantees the minimum number of ‘precharge’ and ‘activate’ commands.
4.4.2. Parametric Integer Linear Programming Formulation
We canmeet the challenge of finding a memory address sequencer which skips addresses
not accessed within a program and will not access memory not referenced in the original
source code by formulating the problem as a formal integer optimization problem.
Integer Linear Programs (ILPs) are optimization problems which seek to minimize an
objective function subject to a set of linear constraints while forcing only integer assign-
82
CHAPTER 4. DESIGN OF PARAMETRIC DRAM CONTROLLERS USING THE
POLYTOPE MODEL
int A[9];
for (i = 0 ; i <= 2 ; i++) {






if (-i - 2*j + 1 >= 0) {
/* Note: integer division */
k = (2*i + 3) / 4;
if (j + k - 1 >= 0) {
R’=2*i+4*j+2;
i’ = i - 2*k + 1;
j’ = j + k;
} else {
R’ = 2*i + 2*j - 2*k + 2;




if (-i - j + 3 >= 0) {
R’ = 2*i + 4*j + 2;
i’ = i + 2*j - 1;
j’ = 1;
} else {









Figure 4.7.: Example showing (a) source code and (b) Output code for solution to example
(a). 83
CHAPTER 4. DESIGN OF PARAMETRIC DRAM CONTROLLERS USING THE
POLYTOPE MODEL
0 , 0 1 , 0 2 , 0
0 , 1 1 , 1 2 , 1
0 1 2 3 4 5 6 7 8
i
j
Set of Memory Addresses
Set of Iteration Points
Address
Figure 4.8.: Mapping from (i, j) iteration space to memory addresses for code in
Figure 4.7(a).
ments to the optimization variables, as illustrated in Equation (4.1). This concept has been
extended to Parametric Integer Programming, as in Equation (4.2), where the constraints
can be described in terms of some parameter q thus producing p as an explicit function of q
rather than as an integer vector, as in ILP.
min
p
kTp s.t. Ap ≤ b (4.1)
min
p
kTp s.t. Ap ≤ b + Cq. (4.2)
One example of a Parametric Integer Linear Programming solver is PIP [48]. This tool
uses a dual simplex method to find a rational optimum and then excludes the rational
value by generating additional constraints (cuts) which exclude the rational optimum
but do not exclude any integer values. PIP forms parameterized Gomory [87] cuts which
decompose the (parameterized) problem space into chamberswhich each have an optimal
value which is a linear function of the problem parameters.
We can use parametric integer linear programming to derive an address sequencing
function for populating reuse buﬀers within our proposed application specific memory
subsystem. We define a problemwhich seeks to find theminimumaddress (R￿) within the
set SM that is greater than parameter (R = fx + h) where x ∈ SE, representing an iteration
84
CHAPTER 4. DESIGN OF PARAMETRIC DRAM CONTROLLERS USING THE
POLYTOPE MODEL
which generates the current address (R). The piecewise function obtained as a solution to
this parametric problem is a recurrence function which can be used to iteratively generate
memory addresses from SM. The sequence of addresses generated is strictly monotonic.
In the section that follows, wedescribe the constraints necessary to formulate the problem.
4.4.3. A Formulation for Finding a Strictly Monotonic Address Sequencing
Function
Our objective is to derive a function that generates a sequence of memory addresses
populating a reuse buﬀer. We propose a method which produces an address sequencing
function directly from the polytope problem definition using parametric integer linear
programmingwith embedded constraintswhich ensure that the strictmonotonicity prop-
erty holds. The solution to the parametric integer linear programming formulation is used
to automatically synthesizememory addressing functions which are implementedwithin
an FPGA.
For a reuse buﬀer introduced at a specific level t in the loop structure we define the
parametric integer linear program in terms of the variable p = [R￿, x￿t, x￿t+1, . . . , x
￿
n]T, repre-
senting the next address and inner loop iterators and parameterized in terms of variable
q = [x1, x2, . . . , xn]T representing an iteration index. Given the current value q of the
loop iterators, our objective is simply to find the first address accessed in the code after
fx + h, the current address. This corresponds to minimizing R￿ subject to the constraints
we outline below. In explaining these constraints, we make reference to the example in
Figure 4.7(a) and assume, for simplicity, that a reuse buﬀer is inserted at the outermost
level (outside the loop nest). Therefore for this example p = [R￿, i￿, j￿]T, q = [i, j]T and
t = 1.
The first constraint is an inequality constraint which ensures strict monotonicity, i.e.
that the next address will be at least one more than the current address.
R￿ ≥ Fq + h + 1 (4.3)
85
CHAPTER 4. DESIGN OF PARAMETRIC DRAM CONTROLLERS USING THE
POLYTOPE MODEL
In our example code, this corresponds to the constraint R￿ ≥ 2i + 4 j + 0 + 1.
The second constraint is an equality constraint that ensures that the new address (R￿)
can be generated by thememory access function f (x) = fq+h through a linear combination
of the inner-loop variables xt, xt+1 . . . xn. Writing f = [f1 f2] :
R￿ − (F1[x1 . . . xt−1]T + F2[x￿t, x￿t, . . . , x￿n]T) = h (4.4)
In our example, this corresponds to the constraint R￿ − 2i￿ − 4 j￿ = 0.
The remaining constraints ensure that the combination of the instantiation of the vari-
ables x￿t, x￿t+1, . . . x
￿
n and the parameter variables x1 . . . xt−1 lie within the iteration space.
Writing A = [A1 A2] :
A1[x1 . . . xt−1]T + A2[x￿t, x￿t+1, . . . x
￿
n]
T ≤ b (4.5)
For our example, the constraints i￿ ≤ 2, −i￿ ≤ 0, j￿ ≤ 1 and − j￿ ≤ 0 ensure this.
The variable q = [x1, x1 . . . xn]T is constrained by the constraints implied by the original
source polytope. In our example code, this means i ≤ 2, −i ≤ 0, j ≤ 1 and − j ≤ 0.
The problem is formulated at design time and used as an input to a parametric integer
linear programming tool Piplib [48]. The solution returned by the tool is a piece-wise
linear function defined as a binary decision tree. The output from the example in Fig-
ure 4.7(a), when converted to an equivalent c code, is illustrated in Figure 4.7(b). If this
code is evaluated with the initial indices (i = 0, j = 0), an address (R￿ = 2) and a further
two indices i￿ = 1, j￿ = 0 are generated. On every iteration of the ‘while’ loop, the variables
i￿ and j￿ are calculated as a function of i and j and fed back into the loop as i and j to
calculate values for the next iteration. A sequence of addresses is formed by the R￿ values
calculated at each iteration. For the example code, the sequence of addresses (0, 2, 4, 6, 8)
is generated. This is exactly the set of memory addresses referenced by the original in-
put code, presented in a strictly increasing order and with all repetition removed. This
ensures that:
86
CHAPTER 4. DESIGN OF PARAMETRIC DRAM CONTROLLERS USING THE
POLYTOPE MODEL
• No unnecessary row-swaps are incurred and the minimum number of ‘precharge’
and ‘activate’ commands are generated.
• Where data reuse occurs within the inner loops, a minimal number of ‘reads’ and
‘writes’ are generated on the external memory interface. In our example, the two
iterations (i = 0, j = 1) and (i = 2, j = 0) both read the address ‘4’, generating two
external memory transactions. The strictly monotonic property of our sequence
ensures that the address ‘4’ is accessed in external memory only once.
In Section 4.4.4 we show that the parametric integer programming solution implicitly
defines a state machine and thus can be easily implemented in hardware on an FPGA.
4.4.4. Inner Loop Hardware Implementation
The architecture of the proposed inner loop address sequencers is shown in Figure 4.9.
This shows the components required to implement the output code shown in Figure 4.7b.
Hardware is automatically produced from the solution to the parametric integer linear
programming (pILP) formulation given in Section 4.4.3 using a custom synthesis tool.
This tool (which is approximately 10K source lines of code) translates the in-memory
binary tree representation of condition, aﬃne expressions and scalar divisions which is
emitted as the solution to formulated pILP problem into a graph of hardware operators.
After substitution of optimized division operators, a verilog representation of the graph
structure is emitted for implementation using FPGA vendor-provided synthesis tools.
The inner loop hardware is made up of ‘first’ blocks and ‘next’ blocks for each memory
reference. As each new outer loop iteration begins (with the outer loops defined by the
parameter t), the ‘first’ block calculates the first memory address that must be sequenced,
implementing a piecewise-aﬃne function of the outer loop variables which gives the
lowest address for any memory reference for a specific outer-loop iteration. The logic
in a ‘next’ block implements a recurrence relationship to sequence subsequent memory
addresses as a statemachinewith states defined by (x￿t, x￿t+1, . . . x
￿
n), a single output (R￿) and
t−1 inputs (x1 . . . xt−1). Both the ‘next’ and ‘first’ address generator blocks are synthesized
87











Outer loop hardware as in Figure 4.4
Figure 4.9.: Inner Loop Design.
automatically from the solution to the formulation in Section 4.4.3.
Control hardware is responsible for sequencing the diﬀerent memory addressing func-
tions within the inner loop and their associated ‘first’ and ‘next’ functions. When ‘done’
is signalled by the final addressing function in the inner loop, the loop signals the outer
loop logic to update to the next value. If it can be determined that only one memory
access occurs for a particular outer loop iteration, the ‘next’ block does not generate any
memory addresses.
‘First’ Blocks
For each memory reference, a hardware block is generated to calculate the first inner
loop address for a specific outer loop iteration. This hardware block is derived from the
parametric integer linear programming solution as a combinatorial arithmetic circuit. In
the majority of cases, the only logical conditions evaluated by the first block are those
validating the original polytope bounds. We know these to be redundant conditions since
the outer loop scanning hardware will only generate values which meet these conditions.
With this simplification, the first block reduces to an aﬃne combination of the outer loop
iterations, which can be eﬃciently implemented using constant coeﬃcient multipliers
88



























































Figure 4.10.: Next Block Design.
and making use of fast carry chain hardware within the FPGA.
‘Next’ Blocks
For a given set of outer loops (determined by t), the inner loopsmay generate one or more
addresses through application of the memory addressing functions in the form fx + h. If
the set of generated memory addresses has exactly one member, for all possible values of
the outer loops, the parametric integer linear programming solution indicates that for all
valid parameter values (i.e. outer loop iterations), no valid integer solutions will satisfy
the strictly monotonic scheduling condition and lie within the original problem iteration
space.
In such circumstances, no ‘next’ state hardware need be generated. In all other cases,
we derive state machine as a direct representation of the parametric integer linear pro-
gramming solution. The output from the circuit is fed back to the inputs to generate a
sequence of addresses, until no further solutions exist. Control then passes to the next
memory reference to generate the appropriate sequence of memory accesses. Figure 4.10
89
CHAPTER 4. DESIGN OF PARAMETRIC DRAM CONTROLLERS USING THE
POLYTOPE MODEL
Table 4.1.: Legend describing the aﬃne functions referenced in Figure 4.10.
Index Aﬃne Function
1 −i − 2 ∗ j + 1 >= 0
2 j + j = 1 >= 0
3 −i − j + 3 >= 0
4 i − 2 ∗ k + 1
5 i + j − k + 1
6 i + 2 ∗ j − 1
7 j + k
8 0
9 1
10 2 ∗ i + 4 ∗ j + 2
11 2 ∗ i + 2 ∗ j + 2 ∗ k + 2
12 2 ∗ i + 4 ∗ j + 2
13 2 ∗ i + 3
shows a representation of the ‘Next Block’ showing aﬃne conditions calculated on the
right and aﬃne output values calculated on the left. The diagram corresponds to the
derived code in Figure 4.7b, with a legend for the aﬃne blocks labelled 1 to 13 provided
in Table 4.1.
The aﬃne condition values are transformed in a priority block to implement the branch-
ing conditions of the binary condition tree. This is used to select fromanumber of possible
output valueswhich are fed back to the circuit inputs at the top of the diagram. The circuit
implements all conditional evaluation and output generation in parallel. If we wish to
sustain an output of one address per cycle, no pipelining is possible due to the feedback
of output values used to calculate subsequent values. This has an impact on achievable
clock frequency.
Optimizations
The output from the parametric integer linear program may require the introduction of
newvariableswhich correspond to theGomory cuts introduced by PIP to exclude rational
optima. These always have the form of an aﬃne function of the parameters, divided by
an integer scalar value, and rounded to minus infinity. We can optimize the hardware
implementation of these expressions using results from [88, 89]. These results prove that
90
CHAPTER 4. DESIGN OF PARAMETRIC DRAM CONTROLLERS USING THE
POLYTOPE MODEL
Table 4.2.: Resource usage andmaximum frequency comparison between inferred divider













6 5 bits 36 21 170 MHz 334MHz
72 12 bits 145 100 83 MHz 215MHz
190 15 bits 204 139 66 MHz 198MHz
192 15 bits 204 111 66 MHz 203MHz
288 11 bits 69 98 150 MHz 225MHz
392 9 bits 58 90 154 MHz 229MHz
400 16 bits 233 144 57 MHz 200MHz
5184 16 bits 163 126 93 MHz 192MHz
when x has restricted range, that division by a scalar constantM, followed by a rounding
operation, can be equivalently represented as a multiply-add operation followed by a








The results in [88] show we can find optimal values, α, β and k , to build optimized hard-
ware scalar division circuits. When applied to the diﬀerent scalar division operations seen
in our examples, our automatic tool, taking a scalar divider and dividend range as input,
can form optimized dividers such as those listed in Table 4.2. When synthesized, these di-
viders take advantage of eﬃcient constant multiply-add operations. These are optimized
in FPGA synthesis tools due to their prevalence in signal processing applications.
We note from the results in Table 4.2 that the optimized dividers have reduced delay
(and thus improved Fmax frequency) over dividers implemented using inferred division
IP within the Quartus synthesis flow, with clock frequency improvements ranging from
1.4× to 3.5×. In all but two of the eight circuits above, the optimized circuits use fewer
LUTs than their inferred division counterparts, with a mean reduction in LUT count of
1.35×. The two circuit implementations which have fewer LUTs in the inferred divider
circuits when dividing by 288 and 392 over 11-bit and 9-bit ranges are likely to be due to
LUT packing optimizations, which struggle to pack these particular circuits into 6-LUTs.
91
CHAPTER 4. DESIGN OF PARAMETRIC DRAM CONTROLLERS USING THE
POLYTOPE MODEL
4.4.5. Results and Discussion
In the two preceding sections, we have described a methodology for deriving optimized
application-specific memory sequencers from a polytope program description and eﬃ-
cient techniques for implementing such sequencers in FPGA hardware. In this section
we describe the benchmarks used to measure the eﬀectiveness of our proposed approach
in improving memory bandwidth and the cost of the necessary hardware circuits.
Four benchmarks have been selected to demonstrate our approach, code listings for
each are given in Appendix A.
Matrix-Matrix Multiply (MMM) In this benchmark two 50x50 densematrices of 64-bit val-
ues aremultiplied using a classic 3-level loop nest implementation. This benchmark
must access two matrices simultaneously using the columns of one and the rows
of the other. The large strides in memory this implies means that row-swaps occur
frequently within the inner loop of the benchmark. Neither the input matrices nor
the output matrix are aligned to SDRAM row boundaries when stored in row-major
or column-major order.
Sobel Edge Detection (SED) This benchmark is a 2D convolution of a 96x64 matrix of
16-bit pixel values with a 3x3 kernel. Because each iteration requires pixels from
three consecutive rows of an image and three consecutive pixels in each row, neither
row or columnmajor storage of the input array can mitigate poor data locality. This
benchmark reuses data as a slidingwindowperforms a convolutions over an image.
The input and output image row sizes do not align with SDRAM row boundaries
which makes manual optimization of this benchmark diﬃcult.
Gaussian Backsubstitution (GBS) This benchmark optimizes the memory pattern in
a Gaussian back-substitution kernel using 32-bit data values. It illustrates that
the methodology presented is equally applicable to non-rectangular loop-nests and
where not all data in a (rectangular) input array need be accessed. furthermore, it
demonstrates a non-constant stride over the blocks in the input matrix. Only the
necessary upper triangular elements of the matrix are loaded into on-chip memory
buﬀers.
92
CHAPTER 4. DESIGN OF PARAMETRIC DRAM CONTROLLERS USING THE
POLYTOPE MODEL
Gaussian Backsubstitution Blocked(BL GBS) This benchmark implements the same
computation as the GBS benchmark, but applies a blocking transformation to in-
crease the number of loop levels. This demonstrates our ability to refine the granu-
larity of our parameterisation using well-known loop transformations.
Each benchmark is expressed as C code and passed through our automatic flow. Static
control portions of the input code are marked with #pragma preprocessor directives and
a polytope description is automatically extracted using [90]. After transformation using
the methodology in Section 4.5.2, synthesizable address generators expressed in Verilog
are generated as the tool output. Each design is parameterized by inserting reuse buﬀers
at diﬀerent levels within the loop nest (where t = 1 implies a reuse buﬀer inserted outside
the outermost loop of the benchmark code).
The address generators produced were connected to the Altera High Performance
SDRAM Controller II [91] in a testbench environment which recorded the SDRAM inter-
face usage at each cycle and the overall benchmark run time. The designs were simulated
using a memory controller running at 333MHzwith the address generation logic running
at the post-fit reported Fmax frequency and decoupled using an eight-entry FIFO buﬀer.
Sequence Characteristics
The number of memory accesses, bus transitions between read and write operations and
and SDRAM row swaps for each benchmark under the diﬀerent parameterisations are
reported in Table 4.3. This reports the intrinsic eﬃciency of the controller in simulation
without considering the controller implementation and the impact that might have on
real memory throughput.
Hardware Characteristics
An application-specific hardware controller is automatically generated for each parame-
terisation of the four benchmarks. These were synthesized using Quartus with physical
synthesis optimizations turned on. Synthesis results are reported in Table 4.7 for a target
Fmax frequency of 333MHz (the maximum speed of the DRAM used). Alongside these
93
CHAPTER 4. DESIGN OF PARAMETRIC DRAM CONTROLLERS USING THE
POLYTOPE MODEL










MMM 1 10000 2 40
MMM 2 132500 100 686
MMM 3 255000 5000 32950
MMM 4 500000 250000 374502
SOB 1 17809 2 211
SOB 2 30550 188 554
SOB 3 116560 11656 22527
SOB 4 139872 34968 48303
SOB 5 209808 104904 144435
GBS 1 2911 2 21
GBS 2 7881 142 508
GBS 3 15336 5112 6919
BL GBS 1 2911 2 21
BL GBS 2 3131 8 42
BL GBS 3 7881 142 508
BL GBS 4 15336 5112 6919
94
CHAPTER 4. DESIGN OF PARAMETRIC DRAM CONTROLLERS USING THE
POLYTOPE MODEL
results, we report the amount of on-chip memory required in each benchmark param-
eterisation. This assumes an on-chip memory mapping scheme such as in [27]. Here
a modulo-mapping is automatically derived which maps multiple oﬀ-chip memory ele-
ments to a single memory address in the on-chip buﬀer. In a practical design, memories
might be replicated to increase the number of ports and enable parallel on-chip access by
independent processing elements. We assume a conservative designwhere theminimum
number of on-chip memories are allocated (hence, a single read port is made available to
on-chip datapath).












MMM 1 60000 75 MHz 2149 184
MMM 2 20800 72 MHz 945 188
MMM 3 808 253 MHz 503 277
MMM Orig. 0 392 MHz 463 342
SOB 1 24276 60 MHz 3091 244
SOB 2 650 67 MHz 1519 224
SOB 3 38 184 MHz 915 306
SOB 4 14 255 MHz 629 391
SOB Orig. 0 398 MHz 759 605
GBS 1 21312 122 MHz 1050 236
GBS 2 588 263 MHz 598 217
GBS Orig. 0 383 MHz 252 164
BL GBS 1 21312 63 MHz 2508 272
BL GBS 2 5904 117 MHz 1610 414
BL GBS 3 588 138 MHz 870 353
BL GBS Orig. 0 250 MHz 478 337
Performance and Resource Usage
Resource usage for the synthesized address generators and on-chip memory buﬀers in
each design can be seen in Table 4.4. Critically, the diﬀerent parameterisations of each
benchmark enable a trade-oﬀ between the amount of on-chip memory used in the design
and the achievable performance. We observe that designs minimizing the wall clock
time used for memory transfer can be achieved at a cost of increased on-chip memory
95









 500  1000  1500  2000












Address Sequencer LUT Usage
Onchip Memory Usage / KBytes









(a) Matrix-Matrix Multiply (MMM)
Figure 4.11.: Resource Usage versus Memory Access Time for (a) Matrix Multiply bench-
mark benchmarks (continued on next page).
usage. This is because increased on-chip memory allows more opportunity for memory
reuse which reduces the number of accesses to oﬀ-chip memory and allows reordering of
transactions, which increases the eﬃciency of oﬀ-chip memory access.
We show this trade-oﬀ graphically in Figures 4.11(a)–(c). In each figure, crosses denote
diﬀerent parameterizations of the benchmark designs, marking the amount of memory
required for each parameterization on the top horizontal axis and the time spent trans-
ferring data on the vertical axis. A dotted line denotes a Pareto-optimal front joining
discrete design points. Designs which are not on the Pareto-optimal front are dominated
by another discrete design point. For instance, the ‘t=2’ parameterization of the GBS
benchmark in Figure 4.11(c) achieves better performance than the ‘t=1’ parameterization
and uses fewer bytes of on-chip memory. Therefore the ‘t=2’ design point sits on the
Pareto-optimal front, while the ‘t=1’ design point is oﬀset. In all other cases, our results
96








 500  1000  1500  2000  2500  3000












Address Sequencer LUT Usage
Onchip Memory Usage / KBytes



















 0  500  1000  1500  2000  2500  3000












Address Sequencer LUT Usage
Onchip Memory Usage / KBytes
Address Sequencer LUT Usage
t=1t=2
t=3




(c) Gaussian Backsubstitution (GBS)
Figure 4.11.: (continued frompreviouspage)ResourceUsageversusMemoryAccess Time
for (b) Sobel Edge Detection and (c) Gaussian Backsubstitution benchmarks.
97
CHAPTER 4. DESIGN OF PARAMETRIC DRAM CONTROLLERS USING THE
POLYTOPE MODEL
show monotonic non-linear improvement of memory performance with the dedication
of on-chip memory resources to data buﬀering.
A similar trade-oﬀ can be seen in the logic resource usage for each design. This is
shown in Table 4.4. In Figures 4.11(a)–(c), the diﬀerent design parameterizations are
shown as squares with the LUT count indicated on the bottom horizontal axis and the
memory access time for the design parameterization indicated on the vertical axis. The
Pareto-optimal front shown indicates that the number of LUTs required to implement our
synthesized address generators increases as the parameter t is decreased. This is because
the number of parametric variables (and hence the number of states in the generated state
machine) grows as t is decreased.
The trade-oﬀ is that increased memory reuse and transaction reordering means the
address generators with the highest LUT-usage also deliver the highest performance. An
important point to note is that both the available on-chip memory and available logic re-
sources on reconfigurable devices have, through process scaling, increased exponentially
over time whereas the available external memory bandwidth and device pin-count has
not. The demonstration of an automatic mechanism to trade-oﬀ memory bandwidth for
on-chip resources has significance in future generations of reconfigurable logic, where
the growing gap between available on-chip memory and logic elements and oﬀ-chip
bandwidth threatens to limit realizable logic resource-utilization.
Further post place-and-route results for the derived address generators including clock
frequency are given in Table 4.4. The Fmax clock speeds of each implementation fall
significantly below the peak rated performance of the embedded hardware resources
on which they are implemented. As t is decreased for each benchmark, the conditions
evaluated at each clock cycle require more terms (and more multipliers) and a longer
critical path. The long critical paths in the arithmetic functions used to generate next-
state variables for our state machine implementation prevent higher clock rates. Since
the parameters calculated in one cycle are used to generate those in the next, pipelining
cannot be used without reducing the overall throughput of the address generator. Yet
in spite of this reduction in achievable clock frequency as t is decreased, the wall clock
98
CHAPTER 4. DESIGN OF PARAMETRIC DRAM CONTROLLERS USING THE
POLYTOPE MODEL
















MMM t = 1 52780 1348 100 1182 55410
MMM t = 2 696604 17012 9128 16320 739064
MMM t = 3 1065680 703656 23876 40430 1833642
MMM Orig. 2000000 5505398 1489596 197490 9192484
SED t = 1 71224 24764 80 2074 98142
SED t = 2 122188 19152 11052 3374 155766
SED t = 3 466228 182928 142000 17434 808590
SED t = 4 562132 511858 266444 29888 1370322
SED Orig. 840076 1768960 863340 76994 3549370
GBS t=1 11600 520 108 232 12460
GBS t=2 32652 11200 4584 1002 49438
GBS Orig. 61272 114486 47328 4802 227888
BL GBS t=1 11632 3340 108 290 15370
BL GBS t=2 12476 956 656 284 14372
BL GBS t= 3 31504 11744 10688 1118 55054
















Figure 4.12.: Figure showing SDRAM bandwidth allocation by command type within a
single reference in the Gaussian Backsubstitution benchmark.
99
CHAPTER 4. DESIGN OF PARAMETRIC DRAM CONTROLLERS USING THE
POLYTOPE MODEL
time taken to load data in each benchmark reduces as t is decreased. A comparison of
parameterization with t = 1 and the original code in Figure 4.5 shows reduction of 165×,
36×, 18× and 14× in the Total Cycles of the four benchmarks respectively. This significant
performance improvement occurs because our methodology increases opportunities for
reuse and reordering as t decreases (with buﬀers inserted at the outermost levels of the
loop nest). Three mechanisms can be identified which contribute to this.
1. The first is the reduction in the number of read requests through the exploitation of
data reuse. Data reuse means some data held in the on-chip reuse buﬀer is accessed
by more than one iteration with the loop nest. On-chip buﬀering means that only
a single request to external memory is made for each reused item. Table 4.4 shows
how the number of read cycles decreases as the reuse buﬀer is moved to the outer
levels of the loop nest (i.e. as t decreases).
2. The second reason for the reduction in overall wall clock time is improved locality
within each memory reference. We show that the reordering of data by our mono-
tonic address sequencing functions reduces the overall number of ‘precharge’ and
‘activate’ commands. Figure 4.12 shows a proportional breakdown of the SDRAM
commands generated by a single reference within the GBS benchmark. The length
of the two bars is proportional to their overall time taken for memory operations.
When parameterized at t = 1 and t = 3, the number of memory accesses is constant,
however the monotonic order in which they are accessed when t = 1 reorders those
accesses. This eliminates the large overhead of ‘precharge’ and ‘activate’ cycles
which in turn means a greater than 2× reduction in total loading time is seen over
parameterization at the the innermost level (t = 3). The reduction of ‘precharge’
and ‘activate’ cycles in Figure 4.12 from 8998 cycles in the (t = 3) parameterization
to 2882 cycles in the (t = 1) parameterization is a direct result of the reordering of the
memory accesses. Furthermore, because refresh commands are issued at a constant
rate, the decrease in overall execution time leads to a proportional decrease in the
number of refresh commands issued (reduced from 406 to 290 cycles).
100
CHAPTER 4. DESIGN OF PARAMETRIC DRAM CONTROLLERS USING THE
POLYTOPE MODEL
3. The third reason is the reduction in the interleaving of accesses to diﬀerent arrays.
When bursts of memory access to diﬀerent arrays are serialized, the inevitable inter-
leaving of accesses to diﬀerent arrays introduces row-swaps. When parameterized
at the outermost level, each burst is longer and there are fewer total interleavings
thus the overhead of ‘precharge’ and ‘activate’ commands is reduced.
4.5. Reordering Inner-Loop Memory Accesses using Variable
Elimination Techniques
The approach presented above uses parametric integer linear programming and a strict
monotonic ordering to generate an address sequence. The parametric solution to the
optimization problem is implemented in digital hardware as a state machine. The results
presented show the method significantly improves memory eﬃciency. However, several
drawbacks of the approach appear. Themost significant of these is that the delay through
the arithmetic logic necessary to evaluate the ‘next’ state for the sequencer is significant.
The critical-paths for the slowest parameterised designs run through more than one
division operator and the achievable clock speeds for such designs fall significantly below
100MHz. The DDR2 memory used in our experiments can generate bursts of 4 or 8 data
words, and therefore requires a new address either on alternate cycles, or every fourth
cycle. Sowhilst the proposedmethodologywill improvememory performance in existing
DDR2 memory devices, it cannot achieve clock-speeds which will saturate the memory
interfaces on a modern high-end DDR3 device. It should be noted that the architecture of
our memory controllers is the bottleneck here, since the LUTs andmultiplier components
which make up the Stratix III chip onto which our designs were mapped will run at
450MHz [92]
In this section, we demonstrate an alternative method for creating application-specific
SDRAM controllers which will scale to high clock speeds and oﬀers additional flexibility
in the ordering of memory accesses. As an overview of the approach, we propose to
represent SDRAM Rows and Bursts explicitly in the Polytope Model as additional loop
101
CHAPTER 4. DESIGN OF PARAMETRIC DRAM CONTROLLERS USING THE
POLYTOPE MODEL
variables (r and u respectively) . We then seek to eliminate variables from the polytope
description, reducing the number of basis vectors, but crucially, preserving the cardinality
of the set ofmemory addresses SE. Thatmeans our elimination stepmust not addmemory
addresses not accessed in the original source code. This elimination step exploits the data-
reuse in the application by eliminating multiple accesses to a memory address.
After eliminating variables to a minimized set of basis vectors, we can use existing
code-generation tools [93] to produce eﬃcient polytope scanning code, using loop trans-
formations reordering the row and burst accesses for eﬃcient access. From this we can
produce eﬃcient hardware using a scheme similar to that introduced in Section 4.3. Code
produced in this manner is amenable to pipeline transformation to achieve high clock
frequencies with single cycle throughput.
We start this section with a further example to motivate our discussion and follow
this with an explanation of our methodology. This is followed with the results which
demonstrate the performance of address sequencers derived using this method.
4.5.1. Motivating Example
The example shown in Figure 4.7 provides a motivation for our methodology, and will
be used as a running example throughout this section to illustrate the algorithms. This is
a ‘toy’ loop nest with three levels (n = 3) and a memory array within the innermost loop.
The array A is assumed to reside in external SDRAM memory and, for didactic reasons,
we assume that each row in the memory has length of 16 bytes and each individual
memory request is made of bursts of 4 bytes. In real SDRAMs, of course the size of each
row is much greater (as is the number of accesses made in a real application), but we have
tried to keep this example as simple as possible to illustrate the key features and novelties
of our approach.
For simplicity of exposition, we assume the array A originates at address 0 in memory.
Thus element A[i] of the array resides at memory address i. Every memory request to
DDR2 SDRAM is made with the selection of a unique bank and row within the memory
followed by a burst request (of 4 or 8 words). Considering only a single bank within
102
CHAPTER 4. DESIGN OF PARAMETRIC DRAM CONTROLLERS USING THE
POLYTOPE MODEL
char A[ 5 6 ] ;
for (x1 = 0 ; x1 <= 2 ; x1++) {
for (x2 = 2 − x1 ; x2 <= 2 ; x2++) {
for (x3 = x1 ; x3 <= x2 ; x3++) {










Figure 4.14.: Memory address and associated fields.
the device, each memory address within that bank can be divided into three bit fields,
corresponding to SDRAMRow, Burst, and ByteWithin Burst, as shown in Fig. 4.14. These
three fields can be represented as vectors in Z3 where the three dimensions denote the
Row, Burst and Byte Within Burst respectively.
When running the code in Figure 4.13 without transformation, a sequence of seven
memory requests is generated, as shown in order in Table 4.6. This sequence exhibits
several features. Firstly, the ordering of requestsmeans that both the first and second rows
of the SDRAM are openedmore than once. SDRAM timing constraints mean a significant
penalty is incurred when ‘activating’ and ‘precharging’ SDRAM rows, hence this is an
ineﬃcient order to access thememory; amore eﬃcient orderwould activate each rowonly
once. Secondly, in some rows, there are multiple accesses to the same burst, for example
Row 2 Burst 0 is accessed both by (x1, x2, x3) = (0, 2, 2) and by (x1, x2, x3) = (1, 2, 1). If we
can store the data from the burst in on-chip memory and reuse it later in the computation,
103
CHAPTER 4. DESIGN OF PARAMETRIC DRAM CONTROLLERS USING THE
POLYTOPE MODEL
Table 4.6.: Sequence of memory accesses generated by example in Figure 4.13.
Order x1 x2 x3 Array Index Row Burst
1 0 2 0 16 1 0
2 0 2 1 25 1 2
3 0 2 2 34 2 0
4 1 1 1 24 1 2
5 1 2 1 32 2 0
6 1 2 2 41 2 2
7 2 2 2 48 3 0
we can reduce the number of external memory transactions. The final important feature
of this sequence is the presence of ‘holes’: not all bursts are accessed within each row;
indeed, burst number 1 is never accessed for any row. Careful attention to these holes is
important in ensuring code is correct (since spurious write operations corrupt data) and
eﬃcient (since non-essential read operations reduce bandwidth eﬃciency).
Our aim, therefore, is to establish an automatic methodology for deriving an eﬃcient
memory subsystem capable of addressing these three features, by reordering external
memory accesses when appropriate, by storing reused data on-chip when possible, and
by ensuring only those memory locations accessed by the original code are accessed by
the derived memory subsystem.
4.5.2. Overview of Methodology
The steps in our compilation flow are described in the flowchart in Figure 4.15, with each
step enumerated below.
1. Parse the ‘C’ kernel code and construct a polytope representation.
2. For each memory reference, augment the polytope description with a variable rep-
resenting the SDRAM row (r) and burst (u) accessed.
3. For each memory reference, find a unimodular matrix and change of variables such
that the maximum number of variables can be eliminated from the polytope by the
suﬃcient conditions in [94].
104
CHAPTER 4. DESIGN OF PARAMETRIC DRAM CONTROLLERS USING THE
POLYTOPE MODEL
Clan / Rose Compiler Frontend
Augment polytope description with row and  burst 
variables
Find unimodular transformation for variable elimination 
using ILP
Construct Cloog input to reorder remaining variables








Figure 4.15.: Flowchart showing steps in methodology.
4. Check the necessary conditions for elimination in [94] for the remaining variables
and use code generation tools to generate transformed code with a reordered loop
structure.
5. Generate pipelined hardware which implements the loop indexing function.
In the sections which follow, we formulate an initial problem description and describe
each of these steps, demonstrating each transformation using our example code from
Figure 4.13.
105
CHAPTER 4. DESIGN OF PARAMETRIC DRAM CONTROLLERS USING THE
POLYTOPE MODEL
4.5.3. Methodology
For the loop in Figure 4.13, the polytope representing the specified loop bounds is given






















The loops contain memory references within the innermost loop, where the array
indexing functions are themselves aﬃne functions of the loop variables, i.e. of the form
A[fx + h] where f is an n-dimensional row vector and h is a scalar. For the example code,
f = [7 8 9], h = 0. We can describe the set SM of memory addresses accessed within a loop
nest as in (4.8). For our example code, if we were to enumerate the elements of this set,
we would obtain SM = {16, 24, 25, 32, 34, 41, 48}, as illustrated in Table 4.6. Crucially, for
each memory reference, the number of elements in the set SM is always less than or equal
to the number of iteration vectors in SE. This is because, while each memory reference
accesses only a single memory element, multiple iteration vectors can access the same
memory element. Exploitation of this is referred to as data reuse since elements in SM
could be stored in on-chip memory and reused on more than one iteration in SE.
SM = {fx + h | ∃x ∈ SE} (4.8)
Each of the memory accesses in SM corresponds to a specific row and aligned burst in
external SDRAM memory. Beyond data-reuse, we can achieve higher oﬀ-chip memory
bandwidth by reordering accesses so that accesses to the same row in external memory
are grouped together and thus the number of row-swaps (and associated ‘precharge’ and
‘activate’ commands) is minimized. We represent SDRAM rows and bursts explictly in
the Polytope Model to help us reorder accesses for improved bandwidth eﬃciency.
106
CHAPTER 4. DESIGN OF PARAMETRIC DRAM CONTROLLERS USING THE
POLYTOPE MODEL
Explicit Representation of SDRAM rows and bursts in the Polytope Model
The first step of our procedure is to explicitly represent the rows and bursts of SDRAM
access by introducing new variables into the polytope representing the iteration space.
If the size of each SDRAM row is Rwords, the row accessed by memory address fx+ h
is given by r = (fx + h) div R = ￿(fx + h)/R￿, where ￿·￿ represents the floor function. If
the size of each SDRAM burst is B words, then the burst number is similarly given by
u = ￿(fx+ h− rR)/B￿. Unfortunately, neither of these representations is amenable to linear
algebraic manipulation, due to the floor functions.













fx + h − rR
B
￿
− 1 < u ≤
￿




which we can write as the linear equalities below, without loss of information
fx + h − R + 1 ≤ Rr ≤ fx + h (4.11)
fx + h − rR − B + 1 ≤ Bu ≤ fx + h − rR (4.12)
We then add these 4 extra inequalities to those alreadypresent defining the loopbounds,
to form an augmented system of linear inequalities that completely describe not only the
iteration space, but the SDRAM rows and bursts accessed within the innermost loop:
107





− f R 0
f −R −B









R − 1 − h
h




The corresponding augmented system for our Figure 4.13 is shown below
−1 0 0 0 0
1 0 0 0 0
−1 −1 0 0 0
0 1 0 0 0
1 0 −1 0 0
0 −1 1 0 0
7 8 9 −16 0
−7 −8 −9 16 0
7 8 9 −16 −4























In principle, we can use this augmented definition of the polytope loop bounds to
rearrange the loops in our loop nest to move the variables r and u which iterate over
SDRAM rows and burst accesses respectively to the outer levels of the loop body. This
transformation gathers together the memory accesses to a specific SDRAM row, and
reduces the number of row swaps (‘activation’ and ‘precharging’ of rows) incurred.
An example code where this transformation results in an optimal ordering of accesses
is shown in Fig. 4.16. The original code for this example is shown in Fig. 4.16(a). In-
terpreting the addition of explicit row and burst variables as the introduction of new
loop iteration variables, the augmented polytope description corresponds to Fig. 4.16(b),
where each of the two innermost loops iterates exactly once, by construction. By itself, this
transformation has only made the loop body more complex, however, it now allows us
to move the r and u variables to the outermost loops using standard loop transformation
techniques [95, 93], and add a buﬀer following [27] resulting in Fig. 4.16(c). Note now
that the x2 loop only iterates once and can thus be eliminated giving the end result shown
108
CHAPTER 4. DESIGN OF PARAMETRIC DRAM CONTROLLERS USING THE
POLYTOPE MODEL
in Fig. 4.16(d). This code is far preferable, as it accesses each row only once, streaming
the data into a buﬀer, coalescing data reads into bursts where possible.
In general, however, such a direct transformation may not be possible. As already
noted, the earlier example in Fig. 4.13 contains ‘holes’. Moving row and column accesses
to the outermost loop levels will, in this case, fill in the holes, introducing superfluous
reads/writes and/or requiring complex guard statements to skip the holes. Thus our
transformation engine aims to determine when such transformations can be safely ap-
plied, and manipulates the loop structure to allow their application. The first question to
address, therefore, is when a loop variable, e.g. x2 in Fig. 4.16, can be eliminated from the
augmented code without changing the set of memory locations accessed.
Variable Elimination
We may formalise the question: Is the set {fx + h | ∃x∈ Zn,Ax ≤ b } equal to another set
{f￿y + h￿ | ∃y ∈ Zm,A￿y ≤ b￿} for some choice of f ￿ (representing the new array indexing
function), A￿ and b￿ (representing the new loop bounds), with m < n? If the answer
is ‘yes’, this tells us that we may eliminate a variable, resulting in a lower complexity
addressing sequencer.
A relatedproblemhasbeen studied in the context of operational researchbyWilliams [94],
who looked at the specific case y = (x1 x2 . . . xq−1 xq+1 xn)T, i.e. the loop iterators are kept
the same, but one variable (xq) is deleted (as in Fig. 4.16). Williams gives the following
suﬃcient conditions for this special case:
• The q th column in matrix A has at least one entry with the value +1 with corre-
sponding entries in all other rows being 0, negative or +1 or
• The q th column in matrix A has at least one entry with the value -1 with corre-
sponding entries in all other rows being 0, positive or -1
We generaliseWilliams’ result by trying to transform the loop body such that the above
conditions are satisfied. We draw on the theory of unimodular loop transformations [95, 96,
97] to write {fx + h | ∃x ∈ Zn,Ax ≤ b} = { fUz + h | ∃z ∈ Zn,AUz ≤ b} for an arbitrary
109
CHAPTER 4. DESIGN OF PARAMETRIC DRAM CONTROLLERS USING THE
POLYTOPE MODEL
char A[ 2 5 6 ] ;
for (x1 = 0 ; x1 <= 15 ; x1++) {
for (x2 = 0 ; x2 <= 15 ; x2++) {
. . = f ( A[x1 + 16 ∗ x2 ] ) ;
} }
(a) Original Code
char A[ 2 5 6 ] ;
for (x1 = 0 ; x1 <= 15 ; x1++) {
for (x2 = 0 ; x2 <= 15 ; x2++) {
/ / Note : / i s i n t e g e r d i v i s i o n . r and u l o o p s have one i t e r a t i o n .
for ( r = (x1 + 16 ∗ x2)/16; r <= (x1 + 16 ∗ x2)/16 ; r++ ) {
for ( u = (x1 + 16 ∗ x2 − 16 ∗ r)/4 ; u <= (x1 + 16 ∗ x2 − 16 ∗ r ) / 4 ; u++ ) {




char A[ 2 5 6 ] ;
char buff [ 1 6 ] [ 4 ] [ 4 ] ;
for ( r = 0 ; r <= 15 ; r++) {
for (u = 0 ; u <= 3 ; u++) {
buff [ r ] [ u ] [ 0 . . 3 ] = burstread ( r ,u ) ;
for ( x2 = r; x2<=r ; x2++) {
for ( x1 = 4 ∗ u ; x1<=4 ∗ u + 3 ; x1++){




char A[ 2 5 6 ] ;
char buff [ 1 6 ] [ 4 ] [ 4 ] ;
for ( r = 0 ; r <= 15 ; r++) {
for (u = 0 ; u <= 3 ; u++) {
buff [ r ] [ u ] [ 0 . . 3 ] = burstread ( r ,u ) ;
for (x1 = 4∗u ; x1 <= 4∗u+3; x1++ ) {




Figure 4.16.: C source code for 2-level nested loop example.
110
CHAPTER 4. DESIGN OF PARAMETRIC DRAM CONTROLLERS USING THE
POLYTOPE MODEL
unimodular matrixU, allowing us to applyWilliams’ elimination procedure to thematrix
AU rather than to the original matrix A. We may therefore expose further opportunities
for variable elimination.
For our specific example from Fig. 4.13, we have added dimensions r and u to our
polytope description, alongside the original loop variables [x1, x2, x3]. Adding r and u
does not change the number of items in the set SM, since eachmemory reference addresses
data in exactly one row and one burst. Applying the unimodular transformation in (4.15),
we can transform our matrix describing loop bounds into those shown in (4.16).
U =

1 0 0 0 0
−1 1 0 0 0
0 −1 1 0 0
0 0 0 1 0










= Uz , AUz ≤ b (4.15)
AU =

−1 0 0 0 0
1 0 0 0 0
0 −1 0 0 0
−1 1 0 0 0
1 1 −1 0 0
1 −2 1 0 0
−1 −1 9 −16 0
1 1 −9 16 0
−1 −1 9 −16 −4
1 1 −9 16 4

(4.16)
From (4.16) , we can see that Williams’ suﬃcient conditions can be applied to eliminate
the z1 and z2 variables. The set of integer points enclosed by the polytope with the z1
and z2 projected out has been reduced (from 7 to 5) by the transformation, but crucially,
the number of unique rows and bursts accessed is the same. The diﬀerence here is that
111
CHAPTER 4. DESIGN OF PARAMETRIC DRAM CONTROLLERS USING THE
POLYTOPE MODEL
we have exploited the explicit representation of rows and bursts to enable data reuse.
Where a burst within a specific row was activated more than once in the original code, in
the transformed code, here it is only accessed once. The problem remaining is to find an
appropriate matrix U to enable variable elimination, which we address below.
Integer Linear Program to Maximise Eliminable Variables
We wish to find a unimodular matrix U that enables a change of variable (x = Uz) such
that the maximum number of variables can be eliminated from a polytope by Williams’
conditions. To do this, we construct the formulation in Figure 4.17 and solve using
CPLEX[98]. The formulation expresses a matrix multiplication of the input matrix A
whose coeﬃcients are known, with the unknownmatrixUmade up of decision variables.
The elements of the resulting matrix AU are labelled Pi, j in our formulation. The decision
variables Di form the diagonal elements of the unimodular matrix U we are trying to
find, and Ni, j are the lower triangular elements. The upper triangular elements of the
unimodular matrix U are all zero. This ensures that the resulting matrix is unimodular
since all lower triangular matrices with diagonal elements of -1 or +1 are unimodular.
The formulation presented finds the optimal unimodular matrix for eliminating vari-
ables by Williams’ conditions subject to the constraint that each of the lower triangular
coeﬃcients Ni, j is bounded in the range [-Sz, Sz]. This is done to ensure that we can al-
ways calculate a constant valueMi, j which is guaranteed to be bigger than Pi, j, as required
by the constraints. The constraints containing Mi, j are trivially satisfied if the associated
binary variable (negi or posi) is zero. This binary variable is only allowed to become 1 if
all the variables in a column of P are less-than-or-equal to 1 or greater-than-or-equal-to -1.
If either the negi or the posi variable for a particular column is non-zero, the sum variable
becomes non-zero. Since the optimization procedure is attempting to maximize the sum
of the sumi variables, the optimization procedure eﬀectively pulls up the sum variables,
finding optimal values for the decision variables which form the unimodular matrix Di
for the diagonal elements, and Ni, j for the lower triangular elements. The binary values
sumi declare whether under the change of variables, x = Uz, the variable zi is eliminable.
112
CHAPTER 4. DESIGN OF PARAMETRIC DRAM CONTROLLERS USING THE
POLYTOPE MODEL
Integer Linear Program for finding unimodular ma-
trix to maximize variable elimination by Williams’
conditions[94], Ak,m is input integermatrix, Sz is a bound






% restricts lower triangular elements to [-Sz Sz]
1 ≤ i ≤ n, 1 ≤ j < i Ni, j ≤ Sz
1 ≤ i ≤ n, 1 ≤ j < i Ni, j ≥ −Sz
1 ≤ i ≤ n, i ≤ j ≤ n Ni, j = 0
% sumi is 0 if posi and negi are both zero
1 ≤ i ≤ n posi + negi − sumi ≥ 0




% Mi, j is precomputed value guaranteed to be larger than Pi, j
1 ≤ i ≤ n, 1 ≤ j ≤ n Pi, j + (Mi, j − 1)negi ≤Mi, j
1 ≤ i ≤ n, 1 ≤ j ≤ n Pi, j − (Mi, j − 1)posi ≥ −Mi, j
1 ≤ i ≤ n Di ∈ {−1, 1}
1 ≤ i ≤ n sumi ∈ {0, 1}
1 ≤ i ≤ n posi ∈ {0, 1}
1 ≤ i ≤ n negi ∈ {0, 1}
1 ≤ i ≤ n, 1 ≤ j ≤ n Pi, j ∈ Z
1 ≤ i ≤ n, 1 ≤ j ≤ n Ni, j ∈ Z
Figure 4.17.: Finding a unimodular matrix which maximises the number of eliminable
columns.
Having found a unimodular function which gives a change of variables and allows
elimination of variables, we apply that unimodular transformation to the original poly-
tope and eliminate the appropriate variables by a Fourier-Motzkin projection [99]. In our
example code, the unimodular matrix in (4.15) is generated, which when multiplied by
the bounds in (4.14) gives (4.16) which allows for the elimination of the z1 and z2 indices
by the suﬃcient conditions in [94]. In our experiments, all the ILP formulations generated
complete within sub-second timing on a desktop PC.
113
CHAPTER 4. DESIGN OF PARAMETRIC DRAM CONTROLLERS USING THE
POLYTOPE MODEL
After applying the optimal unimodular transformation, further necessary conditions
from [94] are checked to see if those variables not identified as eliminable in the ILP
formulation (which quickly checks for suﬃcient conditions) can be eliminated without
creating holes. The interested reader is referred to the final section of [94] for further
explanation of these conditions, with the note that the complexity of checking these
conditions is dependent on the coeﬃcients of the loop bounds, which increase with the
size of the data-set to be processed. Our ILP approach scales instead with the number of
nested loops which is independent of the size of the input data.
For our example code in Figure 4.13, checking these necessary conditions shows that
the remaining variable (z3) can be eliminated from the row dimension without creating
holes, but cannot be eliminated from the burst dimension without creating holes. This
is consistent with our sequence of memory accesses shown in Table 4.6. We use this
information to reorder the loops.
Since all the variables can be shown to be eliminable from the ‘r’ dimension, that
dimension is traversed at the outermost level of the generated loops, followed by the only
remaining existential variable shown not to be eliminable in our ILP formulation, z3 . This
loop variable is nested inside the ‘r’ variable but outside the ‘u’ variable in the resulting
code, because it can be eliminated without causing holes in the ‘r’ dimension, but cannot
be eliminated without causing holes in the ‘u’ dimension. All the other dimensions can
be safely projected out. When this projection and ‘C’ code generation is performed using
Cloog [93], the data transfer code in Figure 4.18 is produced.
In this generated code, we observe that the sequence of rows accessed is monotonic,
bursts within each row are accessed only once and the set of rows and bursts accessed
is exactly the set in the original code description (i.e. the holes in the original set are
preserved). The access pattern of this transformed code contains two fewer memory
accesses and two fewer row activations, as a result, its usage of scarce external memory
bandwidth is more eﬃcient than the code in the original example. The final stage of our
procedure is to generate an eﬃcient hardware address generation function to implement
this transformed loop structure.
114
CHAPTER 4. DESIGN OF PARAMETRIC DRAM CONTROLLERS USING THE
POLYTOPE MODEL
char buff [ 5 6 ] ;
for ( r = 1 ; r <= 3 ; r++) {
for ( z3 = c e i l ( ( 16r + 2) / 9 ) ;
z3 <= min (6 , 2r + 1 ) ; z3++) {
for ( u = −4 ∗ r + 2 ∗ z3 ;
u <= −4 ∗ r + 2 ∗ z3 ; u++) {




Figure 4.18.: Transformed source code for memory accesses in example code from Fig-
ure 4.13.
Code Generation
The tool Cloog [93] is used to generate nested loop structures by traversing the integer
points within an input polytope in a specified order. Cloog generates an abstract syntax
tree which can be directly translated into ‘C’ code. For our example, the generated code
is given in Figure 4.18. We choose to work with this abstract syntax tree to produce
pipelined streaming hardware which implements the loop index generation.
Statementswhichmay occur in the abstract syntax tree include ‘for’ statements, ‘assign-
ment’ statements and ‘compound’ statements describing the serial composition of more
than one statement. The expressions within those statements include integer division,
multiplication by a scalar, reduction of a vector using min, max and summation functions
and modulus, floor and ceiling functions. We want to produce hardware which emits
each integer point in the optimized polytopes in a sequence. We follow Quillere’s algo-
rithm [77] for code generation which relies on recursive projection using Fourier-Motzkin
elimination onto loop indices to derive nested bounds functions
The expressions containedwith the upper and lower bounds of the ‘for’ statements and
within the right hand side of assignments statements can be quite complex, and without
pipelining, negatively impact the achievable clock frequency. However, because Cloog
115
CHAPTER 4. DESIGN OF PARAMETRIC DRAM CONTROLLERS USING THE
POLYTOPE MODEL
derives nested loop structures in which the inner loop indices only depend on indices in
outer loops, we can arrange the logic as a feed-forward pipeline with distributed control,
adding arbitrary pipeline stages andusing the auto-pipelining features of a logic synthesis
tool (Altera Quartus II [85]) to distribute them in a manner which minimizes the length
of the critical timing path. This ensures that our hardware implementation of address
generation is scalable to meet future requirements for high clock-speeds.
The synthesis and transformation of the Cloog AST format into hardware address
generators is done using another custom synthesis tool. The tool, which is approximately
15K source lines of C++ code, takes the abstract syntax tree emitted from Cloog and
transforms it into a hardware pipeline capable of producing a new iteration vector every
cycle. The tool is able to generate code for perfectly nested loops and loop nests which
contain sequential composition of statements (for instance, imperfectly nested loops).
This allows us to generate hardware from a union of polytopes, with any general aﬃne
schedule function specifying the sequencing order.
Verilog code representing the generated hardware pipeline is emitted for synthesis and
implementation in an FPGA. Results are reported in Section 4.5.4 showing the eﬃciency
of each of the generators produced using this tool.
4.5.4. Results
In the results in Figure 4.7, we can see the total cycles required to fetch data in each
benchmark decreases as we decrease the parameterisation level (t). This is in part due to
data reuse. When data reuse buﬀers are inserted at the outermost levels of the loop (t = 1),
all accessed data is preloaded into on-chip memory at the start of execution, and fetched
from on-chip memory during execution. We would expect to see a significant reduction
in the number of ‘read’ / ‘write’ cycles on the external interface as the parameterisation
level is reduced and more data is buﬀered on-chip. If we compare the original code
(t = n+1) with the (t = 1) parameterisation in Table 4.7, we see results consistent with this
expectation, with a 400×, 94× and 33× reduction in the number of ‘read’ / ‘write’ cycles
in each respective benchmark.
116
CHAPTER 4. DESIGN OF PARAMETRIC DRAM CONTROLLERS USING THE
POLYTOPE MODEL
Alongside this evidence of data reuse, Table 4.7 also shows a breakdown of the total
benchmark time into the cycles in which the interface performs ‘reads’ and ‘writes’, the
cycles in which it is idle due to bus turnaround time (transition from read-to-write and
vice-versa) and the ‘precharge’ / ‘activate’ and ‘refresh’ cycles lumped together with their
respective delay cycles. This information is also presented visually in Figure 4.19. From
this we can see that the reordering of memory transactions through our loop transforma-
tions increases the eﬃciency of the memory interface usage. In the original code in each of
the three benchmarks, ˜75% of memory interface cycles are used for the control overhead
of changing SDRAM rows and bus turnaround cycles. Since our static analysis approach
groups together the memory requests which occur in rows and bursts, it significantly
increases the eﬃciency of the external memory interface. If we compare the original
code with the parameterisations which have reuse buﬀers inserted outside the innermost
loop level (t = 3 for MMM, t = 4 for SOB and t = 2 for GBS), we see a reduction in the
proportion of memory cycles used for ‘precharge’ and ‘activate’ commands from 73.24%
to 21.27%, from 70.80% to 59.35% and from 57.12% to 23.45% in the MMM, SOB and GBS
benchmarks respectively. The gains in eﬃciency (the proportion of ‘read’ / ‘write’ cycles
as compared to the original code) vary from 3.6× to 4× across the benchmarks and their
associated parameterisations.
117

























































































































































































































































































































































































































































































































































Figure 4.19.: SDRAMMemory Interface Utilization : Breakdown by Command Type.
In order to show that the eﬃciency gains arising from this reordering are achievable
at a reasonable cost, we report synthesis and Fmax results from the slow 1100mV corner
of the static analysis tool in Quartus 10.1 with physical synthesis and register retiming
options enabled. Registers were inserted by our code-generation flow to ensure the
address generator met a 166MHz clock frequency. This corresponds to half the command
frequency of our external DDR2Memory since the minimum burst size of DDR2memory
(4 words) means two clock cycle periods are needed to process consecutive back-to-back
memory requests. It should be noted that since our address sequence generators can be
pipelined to an arbitrary depth, they are scalable to future memory speeds at the cost of
increased register count and initial latency.
The post place-and-route maximum frequency of our address generator designs and
their resource requirements (ALUTs and registers) are reported in Table 4.8. We show the
number of on-chip memory words needed to implement our three benchmarks, inserting
reuse buﬀers at diﬀerent levels in the loop nest. This is reported in words rather than
an absolute number of bytes to reflect the fact that external SDRAM interfaces typically
bundle together multiple parallel data-channels with commands issued by a single set
of control signals (which allows scaling of our benchmark runs from 8-bit to 64-bit data
types). The synthesis results show that our backend generation tool will scale to useful
clock-frequencies. The logic resource utilization of the address generators at diﬀerent
parameterisation levels varies from a reduction of 0.5× to an increase of 1.4× when
119
CHAPTER 4. DESIGN OF PARAMETRIC DRAM CONTROLLERS USING THE
POLYTOPE MODEL
compared to the original code in the three benchmarks. It should be noted however that
even the largest address generator presented uses less than 4% of the smallest available
Stratix III device (EP3SL50). From the synthesis results reported, we can conclude that
our address generators achieve their reordering at a very reasonable logic cost, and will
scale to useful clock frequencies in modern devices.
Table 4.8.: Synthesis results for benchmark codes.
Benchmark Level Req. on-chip ALUTs Regs Frequency
mem. words
MMM t=1 60000 575 764 296 MHz
MMM t=2 20800 1050 1666 174 MHz
MMM t=3 808 1346 2098 179 MHz
MMM Orig. 0 1003 2740 184 MHz
SOB t=1 24276 592 717 300 MHz
SOB t=2 650 1551 2251 182 MHz
SOB t=3 38 1355 1907 144 MHz
SOB t=4 14 1200 2566 153 MHz
SOB Orig. 0 1107 3607 148 MHz
GBS t=1 21312 833 1156 242 MHz
GBS t=2 588 952 1366 211 MHz
GBS Orig. 0 804 2263 186 MHz
To explore the trade-oﬀ between on-chipmemory resources and external memory inter-
face performance, Figure 4.20 shows how the overall number of memory access cycles
scales with the amount of on-chip memory dedicated to buﬀering data for each of the
benchmarks. From this we can see that if all the data in the MMM benchmark can be
stored on-chip, 1500× fewer memory access cycles are needed to transfer data from ex-
ternal memory. What is more significant about this plot however, is that it shows that
one can, using an automatic tool, explore more reasonable trade-oﬀs such as the t = 3
parameterisation for the SOB benchmark, which achieves a 6.6× reduction in memory
access cycles at a cost of 20 additional words of on-chip memory. In each benchmark,
our Pareto-optimal fronts show multiple feasible points for on-chip memory usage and
automatic generation of address generators using our methodology allows evaluation of
the performance trade-oﬀ each embodies early in the design cycle.
While other work, such as [27], demonstrates similar trade-oﬀs between on-chip mem-
120
CHAPTER 4. DESIGN OF PARAMETRIC DRAM CONTROLLERS USING THE
POLYTOPE MODEL
ory usage and performance due to data reuse, our explicit representation of SDRAM rows
and bursts and static reordering ofmemory transactions achieves additional performance
gains from the eﬃcient utilization of the memory interface; the SOB t = 3 parameterisa-
tion achieves 6.6× better performance than the original code, despite having only a 4.7×
reduction in ‘read’ / ‘write’ cycles due to data reuse. This is because by reordering transac-
tions we have reduced the proportion of ‘precharge’ / ‘activate’ cycles in that benchmark
from 70.8% of the total interface cycles to 52.5%. Hence we can conclude that both the
data reuse uncovered using our methodology and the transaction reordering achieved














































Figure 4.20.: Pareto-optimal fronts showing designs parameterised at diﬀerent levels.
In Table 4.9, we report the tool runtime. The runtimes are aggregated across all param-
eterisations of the benchmarks. We report separately the time spent checking suﬃcient
conditions for variable elimination using our ILP formulation and the time spent exhaus-
121
CHAPTER 4. DESIGN OF PARAMETRIC DRAM CONTROLLERS USING THE
POLYTOPE MODEL
tively checking the necessary conditions for elimination of a variable in the event that
the suﬃcient conditions are not met. From Table 4.9, we note that for each benchmark,
the mean time for checking the suﬃcient conditions for variable elimination using our
ILP formulation is less than a second, with a narrow standard deviation. In comparison,
the mean time taken to check the necessary conditions for variable elimination (for those
variables which cannot be eliminated by the suﬃcient conditions) is much greater, and
varies much more significantly between the diﬀerent parameterisations of each bench-
mark. This is because the runtime of our ILP formulation depends on the number of loop
variables (n) present in the source code while checking the necessary conditions takes
time proportional to the number of points in the iteration space SE. In practice this means
the time taken to check the necessary conditions for variable elimination scales poorlywith
large benchmarks, but runtimes are greatly improved if we first eliminate the variables
which meet the suﬃcient conditions using our ILP formulation. Together these results
Table 4.9.: Tool Runtime.
Benchmark Time Taken / s Time Taken / s
(Suf. Cond.) (Nec. Cond.)
µ (mean) σ (std. dev.) µ (mean) σ (std. dev.)
MMM 0.24 0.03 25.62 30.36
SOB 0.44 0.20 518.63 1009.05
GBS 0.26 0.01 0.41 0.57
show that our ILP formulation, and the checking of suﬃcient conditions for variable
elimination using Williams’ results [94] allow us to produce safe performance enhancing
loop transformations with reasonable compile-time. The methodology and tool built
around these insights allows the automatic production of address generators whose logic
cost is reasonable in modern devices and whose frequency scales to useful clock speeds.
Parameterisation allows the trade-oﬀ of on-chip memory resources for performance by
both a reduction in the amount of data transferred on the external memory interface and
an improvement in the eﬃciency of that interface through reduction in the proportion of
interface cycles required for ‘precharge’ and ‘activation’.
122
CHAPTER 4. DESIGN OF PARAMETRIC DRAM CONTROLLERS USING THE
POLYTOPE MODEL
4.6. Comparison Between Proposed Memory Scheduling
Techniques
The two memory scheduling methodologies introduced in this chapter both develop
application-specific memory scheduling hardware extracted from a polytope model rep-
resentation of statements within a loop nest. Using both techniques, we have demon-
strated a reduction in the number of memory transfers through the exploitation of data
reuse from on-chip memory and an increase in memory interface eﬃciency through the
reordering of commands and a reduction in the number of ‘precharge’ and ‘activate’
commands due to row-swaps. In this section we consider the relative merits of each
technique.
In the first of our proposed methodologies, we restricted the memory accesses to a
strictly monotonic schedule. This is a unnecessarily strict restriction on the ordering
of the memory operations. The key problem with this method is the complex logic
generated to ensure monotonicity. In some cases, the automatically-generated binary
decision tree expresses simple memory sequences in a less-than-obvious manner. As
an example, the generated code from our example shown in Figure 4.7(b) generates a
sequence of memory addresses containing only addresses that are divisible-by-two. An
obvious manually-derived design would implement this as a counter with an output
right-shifted by one place. However the solution given by the parametric integer linear
programming tool generates a binary tree of conditions whichmust choose between three
diﬀerent recurrence relations to choose the next state. This is a non-obvious solution.
This complexity arises because the method explicitly calculates a loop variable x for
each associatedmemory-address in the sequence such that fx+h = R￿. In some use-cases,
this may be a very useful feature, for instance, if one wishes to implement the method
in [27] which requires the original loop variables x from which to calculate an eﬃcient
mapping for on-chip memory storage. However, if the loop variables are not used, the
technique in Section 4.4 will generally produce hardware solutions which require more
logic to implement than the hardware sequencers produced in Section 4.5.
123
CHAPTER 4. DESIGN OF PARAMETRIC DRAM CONTROLLERS USING THE
POLYTOPE MODEL
Throughout this chapter, we assumed that memory access and execution did not occur
concurrently. If this restriction is removed and communication operations are overlapped
with concurrent datapath execution operations (as we do in Chapter 5), it becomes more
important that the sequence of data fetched from memory and that consumed by the
data path are matched, to ensure that the datapath is not forced to stall waiting for
necessary data. In this case, the hardware sequencers produced by variable elimination
in Section 4.5 have more flexibility to match the schedules of memory access thread and
datapath threads, which gives more scope for overlapping operations in the two threads.
The biggest diﬀerence in the implementation of the two schedulingmethodologies is in
the achievable clock frequency. When amonotonic schedule is derived, the delay through
the generated recurrence relationship limits the throughput of the implementation. In
the elimination approach, pipelining can be used to generate hardware with high clock
frequencies which will scale to meet the challenge of saturating the available memory
bandwidth in future memory technologies. In the hardware sequencers designed using
parametric integer linear programming, as the parameterisation level is increased in our
first approach, we see an increase in the achievable clock frequency as the depth of
the critical path in the recurrence relationship is reduced. In contrast, where hardware
sequencers are developed using a variable elimination techniques in Section 4.5we see the
opposite relationship - as we increase the parameterization level, a lower clock frequency
is achieved. This diﬀerence arises because the hardware implementations used for the
monotonic scheduling are limited by the delay through their datapath, which grows with
the number of Gomory Cuts required to enure monotonicity (which is related to the
number of loop levels processed by the inner loop hardware). In the variable elimination
approach, the frequency is limited by control hardwaremanaging the pipeline operations,
which grows linearly with the number of loop levels (i.e. clock speed is now related
to the number of outer loop levels. A similar observation can be made regarding the
register utilization in the proposed designs. Where variable elimination creates deeply
pipelined address sequencers, the number of registers grows with the depth of the loop
nest implemented. A significant number of registers is required to maintain the pipeline
124
CHAPTER 4. DESIGN OF PARAMETRIC DRAM CONTROLLERS USING THE
POLYTOPE MODEL
state as the diﬀerent memory references are interleaved. Hence the number of registers
grows with the level of parameterization. In comparison, the number of registers in
the strictly monotonic scheduling variants grows much slower because the simple aﬃne
conditions which form the upper and lower loop bounds of the outer loop do not form
part of the design critical path.
Beyond these quantitative measures, one key advantage of the methodology presented
in Section 4.5 is that, after variable elimination, the output is a polytope. This means
that techniques for analysing polytopes (for instance, the integer counting techniques
presented in Chapter 3) can be used to examine the properties of the output, and aid in
the integration of hardware sequencers into a complete design.
4.7. Summary
In this chapter we have presented two novel methodologies for synthesizing application-
specific address generators which exploit data reuse and reorder data accesses. Both
approaches exploit memory reuse to reduce the number of data access cycles and com-
mand reordering to increase the eﬃciency of data access. In selected benchmarks, data
reuse allows a 50x reduction in the number of memory accesses.
The constraints which ensure the sequence of memory addresses requested by each ac-
cess function is strictly increasing ensure an eﬃcient use of SDRAMmemory bandwidth.
When exploited together, these two aspects provide up to 165× reduction in overall wall
clock time in memory intensive benchmarks.
In parameterized designs, we have demonstrated a procedure capable of trading on-
chip memory and logic resource for memory performance. This is significant because
process-scaling has historically delivered exponential growth in FPGA logic resource and
on-chip memory, while memory-bandwidth and package pin-count have increased at
a much slower rate. In Chapter 5, we use the techniques from Section 4.5 to exploit
data reuse from on-chip memory and reordering of memory transactions, whilst also
overlapping concurrent operations in datapath and memory access threads to reduce
overall execution time.
125
5. Predictable Memory Access Scheduling
using Integer Point Counting Techniques
5.1. Introduction
In the previous chapter, we demonstrated a method for creating an application-specific
memory address sequencer specialized to provide optimal SDRAM interface bandwidth
for the memory accesses within a targeted loop-nest.The method extracted a high-level
computationalmodel from source code, andproduced twoparallel threads, onededicated
tomemory access and another performing datapath operations. These threads communi-
cate through on-chip memory, which enables scheduling freedom in the memory access
thread which can be exploited to reorder memory accesses and improve memory inter-
face bandwidth eﬃciency. We demonstrated a trade oﬀ between the amount of on-chip
memory required for buﬀering data and the overall eﬃciency of the external memory
interface.
Furthermore, we described how an eﬃcient hardware memory address sequencer can
be automatically constructed from the transformed memory access thread and demon-
strated the impact that reuse and reordering can have on memory access time in a real
system. Novel methods for realising the derived memory address sequencers using dig-
ital hardware allowed us to show bandwidth eﬃciency improvements of between 3.6×
and 4× across three representative benchmarks.
Both methodologies presented in Chapter 4 described memory reordering transforma-
tions which changed the order in which memory accesses were made in the memory
126
CHAPTER 5. PREDICTABLE MEMORY ACCESS SCHEDULING USING INTEGER
POINT COUNTING TECHNIQUES
access thread to improve memory interface performance. In the datapath thread, we
preserved the original ordering of operations from the original source code to ensure
data dependencies are correctly observed. Throughout, we assumed that the memory
access and datapath threads may not issue concurrent operations. In this chapter we ex-
plore techniques which allow scheduling of concurrent operations in the memory access
and datapath threads. We illustrate our general techniques using a small representative
example code, using it both to provide the context needed to understand the proposed
technique and to demonstrate the performance improvement that can be realised when
memory access and datapath operations run concurrently.
To ensure that correct behaviour is observed, there are constraints in the relative order-
ing of operations between threads. The most fundamental requirements are :
• At any specific point in time where data is consumed by the datapath thread, the
data must have been read from external memory in the memory access thread at
some earlier point in time and be ready in the on-chip memory.
• At any specific point in time when data is produced by the datapath thread, the data
is written to on-chip memory andmust be written back to external memory at some
later point in time by the memory access thread.
In the previous chapter, we restricted our scheduling of operations in the datapath and
memory access threads in two ways to meet these requirements :
1. We specified that operations could not occur concurrently in both memory access
and datapath threads.
2. We partitioned the operations in the datapath thread using a parameter t and en-
forced the requirement that all the read operations required for a specific outer loop
tiling are completed before any datapath operations on that tile are performed and
all write operations to external memory occur after all datapath operations have
completed.
127
CHAPTER 5. PREDICTABLE MEMORY ACCESS SCHEDULING USING INTEGER
POINT COUNTING TECHNIQUES
While these conditions are suﬃcient to ensure correct execution, more eﬃcient oper-
ation can be achieved when operations in the memory access and datapath threads are
allowed to occur concurrently. Overlapping the memory access and datapath operations
allows us to reduce the total time needed to complete execution of a nested loop kernel.
In this chapter, we develop a methodology which allows us to derive the constraints
whichmust be satisfied to ensure correct concurrent operations in thememory access and
datapath threads. The methodology uses the integer point counting theory introduced in
Chapter 3. The scheduling constraints between the memory access and datapath threads
take the form of piecewise quasi-polynomials which must be positive to ensure correct
program semantics. A bounding procedure based on Bernstein polynomial decomposi-
tion finds parameters under which this condition is met. Using these parameters we can
automatically derive an implementation which overlaps operations safely. The results
from the application of our proposed procedure on an example code show that concur-
rent operations in memory access and datapath threads can reduce overall runtime by up
to 33%.
A second key outcome which naturally arises through the methodology in this chapter
is the ability to determine, at compile time, the exact execution time of a computational
task. We demonstrate how this can be achieved in Section 5.7, incorporating both the
memory access delays, and the delays required to ensure safe concurrent operations in
memory access and datapath threads. This technique is likely to be very valuable in early-
stage design-space exploration and especially, in the context of a high-level synthesis flow
for real-time systems, for ensuring that candidate designs achieve suﬃcient performance
to meet periodic task completion deadlines.
A discussion of the technique follows. We begin in Section 5.2 with an example to show
the benefits of concurrent operations in the memory access and datapath threads. This is
followed by a formal description of the problem in Section 5.3, and the development of
scheduling constraints in Section 5.4 to express the conditions under which a concurrent
schedule correctly preserves the original program semantics. In Section 5.6, we describe
a method for finding an optimal solution satisfying the scheduling constraints and show
128
CHAPTER 5. PREDICTABLE MEMORY ACCESS SCHEDULING USING INTEGER
POINT COUNTING TECHNIQUES
that this can be used to automatically construct output code.
We show results for an implementation of our technique in Section 5.8 and in Section 5.9
provide a discussion of those results, some shortcomings and possible solutions to those
problems and how futureworkmight develop these scheduling techniques as an eﬀective
optimization framework for memory subsystems.
5.2. Motivational Example
In Figure 5.1, we show source code for a three-level nested loop. This is an extended
version of the example used in Section 4.5.1 of the previous chapter, which only demon-
strated ‘read’ operations. For completeness, the extended version includes both ‘read’ and
‘write’ operations. Within the innermost loop, the array RA[] is read using the addressing
function f1(x) = 7 ∗ x1 + 8 ∗ x2 + 9 ∗ x3 and the array RB[] is written using the addressing
function f2(x) = 3 ∗ x1 + 2 ∗ x2 − 4. The call to the function ‘func()’ shown in the innermost
loop could be any datapath function.
char RA[ 5 6 ] ;
char RB [ 6 ] ;
for (x1 = 0 ; x1 <= 2 ; x1++) {
for (x2 = 2 − x1 ; x2 <= 2 ; x2++) {
for (x3 = x1 ; x3 <= x2 ; x3++) {




Figure 5.1.: C source code for 3-level nested loop example. .
129
CHAPTER 5. PREDICTABLE MEMORY ACCESS SCHEDULING USING INTEGER
POINT COUNTING TECHNIQUES
Here we apply the technique first introduced in Chapter 4 and specifically Section 4.5.
To recap, this means :
• Forming an augmented polytope for each memory operation, which includes addi-
tional dimensions r and uwhich indicate the rowandburst accesses at each polytope
iteration.
• Finding unimodular matrices which allow the elimination of dimensions from each
augmented polytope descriptions where Williams’ conditions [94] for safe elimina-
tion of integer variables are met.
• Applying the selected unimodularmatrices and eliminating the appropriate dimen-
sions from each polytope description formed.
• Applying a loop transformation to ‘hoist’ the row dimension (r) in each memory
access polytope to the outermost loop level inside a outer loop tile 1.
If we apply these steps, using a t = 1 parameterisation, and assume as before that rows
are 16 bytes in length and bursts contain 4 bytes, we obtain the memory access thread
code shown in Figure 5.2. Unimodular matricesU1 andU2 are used to transform the read
accesses to array RB[] and write accesses to array RA[] respectively.
The derived code for the memory access thread shown in Figure 5.2 contains ‘burst-
read’ operations formed by the application of the unimodularmatrixU1 such thatU1z = x
and the elimination of variables (z1) and (z2). The elimination of variables in this way will
not create ‘holes’ (i.e. iterations who have no integer image in the original source code)
because the transformed loop boundsmatrix (A1U1x < b1)meetsWilliams’ conditions [94]
given in Section 4.5.3 of Chapter 4 for the safe elimination of integer variables. The ‘burst-
write’ operations shown in the derived code for the memory access thread in Figure 5.2
are enclosed by a loop nest formed by the application of U2 such that U2z = x and the
elimination of variable (z3).
1The outer loop tiles are defined by the parameter t
CHAPTER 5. PREDICTABLE MEMORY ACCESS SCHEDULING USING INTEGER
POINT COUNTING TECHNIQUES
/ ∗ no t e : f l o o r d ( x , y ) = x / y rounded t o − i n f i n i t y ∗ /
for ( r = 1 ; r <= 3 ; r++) {
row delay ( ) ;
for ( z3=max(−6 ,−2 ∗ 4r − 1 ) ; z3<=f loord (−16 ∗ r + 6 ,7 ) ; z3++ {
u = −4 ∗ r − 2 ∗ z3 ;
burs t read ( r , u ) ;
}
}
/ ∗ Turnaround d e l a y a l l ow s memory bus t r a n s i t i o n from
r e ad t o w r i t e o p e r a t i o n s ∗ /
turnaround delay (Trw ) ;
/ ∗ Row d e l a y a l l ow s t ime f o r SDRAM row t o be opened ∗ /
row delay ( ) ;
for ( z1=−2;z1<=0;z1++) {
for ( z2=−z1 − 2 ; z2<=min ( 0 ,−2 ∗ z1 − 2 ) ; z2++) {
u = f loord (−5 ∗ z1 − 2 ∗ z2 + 4 ,4 ) ;
bur s t wr i t e ( 3 , u ) ;
}
}
Figure 5.2.: C source code for transformed memory access thread.
The derived memory access thread code in Figure 5.2 also shows ‘row delay’ state-
ments. These model the delay required to issue ‘precharge’ and ‘execute’ commands and
change the active SDRAMrow. The specificmemory operations generated by thememory
access thread with these parameters are summarized in Table 5.1. Assuming a row-swap
penalty of Tr = 10 cycles, and assuming ‘burst read’ and ‘burst write’ operations take
Tu = 2 cycles, it will take 42 cycles to complete the read operations to external memory
and 18 cycles to complete the write operations.
In Figure 5.3, we show the derived datapath code for the example code. Within the
131
CHAPTER 5. PREDICTABLE MEMORY ACCESS SCHEDULING USING INTEGER
POINT COUNTING TECHNIQUES
/ ∗ Exe cu t i on d e l a y a l l ow s enough da t a t o be f e t c h e d in t h e
memory t h r e a d t o s a f e l y o v e r l a p sub s e qu en t e x e c u t i o n with
memory f e t c h o p e r a t i o n s ∗ /
execut ion delay (Te )
for (x1 = 0 ; x1 <= 2 ; x1++) {
for (x2 = 2 − x1 ; x2 <= 2 ; x2++) {
for (x3 = x1 ; x3 <= x2 ; x3++) {
r = f loord (7 ∗ x1 + 8 ∗ x2 + 9 ∗ x3 , 1 6 ) ;
ur = 2 ∗ x1 + 2 ∗ x2 + 2 ∗ x3 − 4 ∗ r ;
uw = f loord (3 ∗ x1 + 2 ∗ x2 + 4 , 4 ) ;
/ ∗ Datapath r e ad and wr i t e o p e r a t i o n s a c c e s s l o c a l memory ∗ /
par {
datapath read ( r , ur ) ;





Figure 5.3.: C source code for transformed datapath thread.
inner loop of the datapath code, a ‘datapath read’ operation fetches data from on-chip
memory and a ‘datapath write’ operation stores data in on-chipmemory. Table 5.2 shows
the datapath operations and the cycle in which they are scheduled.
The statements ‘turnaround delay’ and ‘execution delay’ have been introduced into
the memory access thread and datapath threads respectively. Together, these statements
implicitly synchronize the operations in the memory access thread with the datapath
thread. The parameter Te determines the lengths of the delay inserted into the datapath
thread to ensure that all data required in the datapath is fetched by the memory access
thread from external memory in advance of the cycle in which it is consumed. The
parameter Trw indicates the length of the delay inserted to ensure that write-operations
to external memory are delayed until after data is produced by the datapath thread.
132
CHAPTER 5. PREDICTABLE MEMORY ACCESS SCHEDULING USING INTEGER
POINT COUNTING TECHNIQUES
Table 5.1.: Sequence ofmemory accesses generated bymemory access thread in Figure 5.2.
Order Cycle z1 z2 z3 Row (r) Burst (u) Command
1 10 - - -3 1 2 ‘burst read’
2 12 - - -2 1 0 ‘burst read’
3 24 - - -5 2 2 ‘burst read’
4 26 - - -4 2 0 ‘burst read’
5 38 - - -6 3 0 ‘burst read’
6 50 -2 0 - 3 3 ‘burst write’
7 52 -1 -1 - 3 2 ‘burst write’
8 54 -1 0 - 3 2 ‘burst write’
9 56 0 -2 - 3 2 ‘burst write’
An optimal legal schedule for simultaneous operations in the datapath and memory
access threads inserts a 35 cycle delay in the datapath schedule before execution begins
(i.e. Te = 35) and requires no ‘turnaround delay’ statement in the memory access thread
(i.e. Trw = 0).
Figure 5.4 shows the legal concurrent schedule for the operations in the two threads.In
this figure, time is depicted vertically down the page, with memory access ‘burst read’
operations shown in the left column, the datapath operations in the middle column and
memory access ‘burst write’ operations shown in the right column. Arrows indicate the
dependencies between the reading of data from external memory and its consumption
by the datapath thread. Arrows also indicate the dependencies formed between the
datapath operation in which data is written to onchip memory and the memory access
thread operation in which it is written back to external memory. The figure shows that
the selected parameters Te = 35 and Trw = 0 correctly ensure all scheduling constraints
between the two threads are met.
The optimal scheduling of this code kernel completes in 56 cycles. An alternative
scheduling with no overlapping takes 65 cycles. In the context of this example, this is
a small relative diﬀerence in performance of only 10.8% percent, but this reflects the
challenging conditions in the memory access thread, where the command schedule is
dominated by the ‘row delay’ operations which model the necessary ‘precharge’ and
‘activate’ commands issued on the SDRAM interface.
133
CHAPTER 5. PREDICTABLE MEMORY ACCESS SCHEDULING USING INTEGER
POINT COUNTING TECHNIQUES
Table 5.2.: Sequence of memory accesses generated by datapath thread in Figure 5.3.
Order Cycle x1 x2 x3 Row (r) Burst (u) Command
1 35 0 2 0 1 0 ‘datapath read’
2 35 0 2 0 3 2 ‘datapath write’
3 36 0 2 1 1 2 ‘datapath read’
4 36 0 2 1 3 2 ‘datapath write’
5 37 0 2 2 2 0 ‘datapath read’
6 37 0 2 2 3 2 ‘datapath write’
7 37 1 1 1 1 2 ‘datapath read’
8 38 1 1 1 3 2 ‘datapath write’
9 39 1 2 1 2 0 ‘datapath read’
10 39 1 2 1 3 2 ‘datapath write’
11 40 1 2 2 2 2 ‘datapath read’
12 40 1 2 2 3 2 ‘datapath write’
13 41 2 2 2 3 0 ‘datapath read’
14 41 2 2 2 3 3 ‘datapath write’
In this chapter, we describe a methodology for determining optimal values for Te and
Trw. We begin in section 5.3 with a definition of new notation needed to formulate the
problem. Every datapath iteration x is associated with at least one memory ‘read’ or
‘write’ operation f j(x). In section 5.4, we show how, using a description of the operations
in the datapath thread, and a total ordering over those operations, we can form a parame-
terised set (in parameter x) which only contains elements which occur before the memory
operation associated with x in the execution order.
Section 5.5 then shows that this parametric set can be transformed into a quasi-
polynomial function, whose domain is x and range is an integer value which is the
exact number of iterations that precedes x in the ordering. An aﬃne combination of these
quasi-polynomials with constants describing the duration of burst operations (Tu) and
row-swap operations (Tr), can be used to determine the time-slack between an item’s
production and consumption in the two threads. In Section 5.6, we show how polyno-
mial bounding techniques can be used to find the minimum slack necessary for correct
operation, and therefore determine the constants Te and Trw. Having described our
analytical methodology, we derive results from a fully automatic CAD tool which imple-
ments it. Section 5.8 shows these results for diﬀerent parameterisations of our example
134










































Figure 5.4.: Concurrent operations in the memory access and datapath threads.
code, demonstrating the performance benefit delivered by overlapping communication
operations with datapath execution.
5.3. Problem Description
In this section, we give a description of the initial steps needed to represent the operations
in the datapath andmemory access threads. The operations within each thread are repre-
sented as a union of polytopes, each representing a single statement ( j), of a specific type
of operation and fixed constant duration. Each polytope is associated with a function,
referred to as a ‘scattering function’ which defines a total ordering over the polytope
elements. The functions σ{comm, j}(z) define total orderings over polytopes defining oper-
ations in the memory access thread and σ{exec, j}(x) defines total orderings for polytopes
defining operations in the datapath thread.
The scattering functions are designed such thatwhen they are applied to their respective
polytopes, their ranges define a total ordering for all the operations in each thread. Over
the course of Section 5.3, we build towards the definition of a function t{comm, j}(x) which
135
CHAPTER 5. PREDICTABLE MEMORY ACCESS SCHEDULING USING INTEGER
POINT COUNTING TECHNIQUES
defines the exact cycle in which operations in the memory access thread are scheduled.
We use these functions, alongwith equivalent functions t{exec, j}(x) defining the scheduling
of operation in the datapath thread to define inequality constraints which represent the
dataflow between the memory access and datapath threads. As we build towards the
definition of t{comm, j}(x), we introduce a significant number of symbols. In Table 5.3, we
describe some common symbols and their intended meaning.
Our methodology in this section builds upon the methodology in Section 4.5 in Chap-
ter 4 for exploiting data reuse and reordering of data-transfer operations in the memory
access thread. We begin in Section 5.3.1with a recap of the essential steps from Section 4.5.
5.3.1. Initial Transformation
The memory access thread is derived from an original source code description of a
perfectly nested loop. Several distinct memory references may be accessed within the
innermost-loop of the target loop nest. In Chapter 4, memory references were charac-
terised as aﬃne functions ( fx + h) of the loop variable vector x. In this chapter we add
subscripts to the memory references to help distinguish each reference. We denote the
total number of memory reference expressions in the original program asm, with d refer-
ences reading data from external memory and w references writing data back to memory.
Hence each memory reference takes the form f j x + hj with 1 <= j <= m. The subscript
indices j are assigned to the memory references ensuring that the read operations take
indices 1 <= j <= d and write operations are assigned indices such that d + 1 <= j <= m.
Assuming the same parameterisation t from Section 4.2 in the preceding chapter, we
adopt themethodology in Section 4.5 of Chapter 4 to derive amemory access threadwhich
exploits data-reuse and command-reordering. To recap, this means that we augment
the original source polytope description with dimensions representing SDRAM rows (r)
and bursts (u) as specified by a particular memory reference f j x + hj. This gives a set
whose elements represent the SDRAM rows (r) and bursts (u) accessed in the original
computation. In this chapter, for consistency, such sets are prefixed with a subscript j, i.e.
the set Sj =
￿
x | x ∈ Zn+2, Ajx ≤ bj
￿
denotes the augmented polytope, whose dimensions
136
CHAPTER 5. PREDICTABLE MEMORY ACCESS SCHEDULING USING INTEGER
POINT COUNTING TECHNIQUES
Table 5.3.: Table of Symbols used in this Chapter.
Symbol Description
n Number of loop levels in original source code.
m Number of memory references in original source code.
d Number of ‘Read’ references in original source code.
w Number of ‘Write’ references in original source code.
x Datapath thread iteration vector.
z Memory access thread iteration vector.
Aj
Augmented Matrix describing bounds of original polytope and
row and column accesses for memory reference j.
cj
Constant oﬀset of augmented iteration space (Ajx ≤ cj) for spe-
cific memory reference j.
f j Vector describing variable part of memory addressing function j.
hj Constant address oﬀset associated with memory reference j.
Uj
Unimodular matrix enabling variable elimination for memory
reference j.
α{n, j} Row n of unimodular matrix Uj.
Ej Set of polytope dimensions eliminated to form S{comm, j}.
S{comm, j}
Set of memory operations in memory access thread associated
with memory reference j (could be ‘read burst’, ‘write burst’ or
‘row delay’ operations).
S{exec, j}
Set of memory operations to on-chip memory in datapath thread
associated with memory reference j (could be ‘datapath read’ or
‘datapath write’ operations).
t{exec, j}(x)
Time (in cycles relative to beginning of outer loop iteration) at
which data for memory reference j for datapath iteration x is
produced or consumed in ‘memory access‘ thread.
t{comm, j}(x)
Time (in cycles relative to beginning of outer loop iteration) at
which data for memory reference j for datapath iteration x is
produced or consumed in ‘datapath‘ thread.
σ{comm, j}
Scattering relation giving partial ordering over operations in
‘memory access’ thread for memory reference j.
σ{exec, j}
Scattering relation giving partial ordering over operations in ‘dat-
apath’ thread for memory reference j.
137
CHAPTER 5. PREDICTABLE MEMORY ACCESS SCHEDULING USING INTEGER
POINT COUNTING TECHNIQUES
represent both original loop iterators and the SDRAM rows (r) and bursts (u) read or
written by a specific memory reference j from the original code.
As in Chapter 4, we apply a unimodular matrix (Uj) to transform the set Sj as in (5.1).
This allows some subset (Ej) of thedimensions in z to be eliminatedusing Fourier-Motzkin
Elimination [46]. The conditions in [94] and method in Section 4.5.3 ensure that no holes
are created when the dimensions in Ej are projected out from the set Sj.
Uj z = x, AjUj z ≤ bj (5.1)
The set of memory datapath operations resulting from Fourier-Motzkin Elimination
of the dimensions in Ej from the set Sj after applying Uj can be represented as the set
S{comm, j} as in (5.2).
S{comm, j} =
￿
z | Djz ≤ b￿j
￿
(5.2)
The Fourier-Motzkin projection of the variables in Ej leaves the set S{comm, j} unbounded
in some dimensions. For ease of exposition later in this chapter, we intersect the un-
bounded set of memory operations with planes constraining the eliminated dimensions
(ei ∈ Ej) to zero.
The application of this procedure to each memory reference j, gives polytopes whose
union is the set of operations which must be scheduled in the memory access thread. In
Section 5.3.2, we formalize the scattering functions σ{comm, j}(z) which defines the order in
which those memory operations are scheduled.
5.3.2. Scattering Functions for Sequential Operation Ordering
In the previous chapter, we derived an orderingwhich ensured eﬃcient use of the external
SDRAM interface. The timing of commands issued on the external memory interface was
managed by the SDRAM controller and deterministically determined the order in which
commands are issued to the controller. In the previous chapter, we relied on an SDRAM
controller to issue the necessary ‘precharge’ and ‘activate’ commands in our optimized
memory schedule. In this section we wish to explicitly model the delays inserted by
138
CHAPTER 5. PREDICTABLE MEMORY ACCESS SCHEDULING USING INTEGER
POINT COUNTING TECHNIQUES
the SDRAM controller to enforce the timing constraints associated with ‘precharge’ and
‘activate’ commands.
To do this, for each memory reference j, we introduce a set of additional ‘row delay’
operations. These are represented by the integer points enclosed within the polytope
S{comm, j+m}. The polytope S{comm, j+m} is derived from S{comm, j}. To create the set S{comm, j+m},
we need a more formal description of the order in which the statements of S{comm, j} are
scheduled to determine when the row variable r is incremented. The scattering function
σ{comm, j}(z) provides this.
The method in Section 4.5 of the preceding chapter eliminated the set of dimensions
indexedbyEj to formS{comm, j}. The scattering functionσ{comm, j}(z) determines the ordering
of the remaining uneliminated variables. As in Section 4.5, scattering functions in the
memory access thread are used to represent a loop transformation hoisting the r variable
to the outermost loop level. This memory access reordering transformation improves
performance by grouping together accesses to the same SDRAM row, thus reducing the
number of ‘precharge’ and ‘activate’ commandswhichmust be issues to change the active
SDRAM row.
Application of the scattering functions σ{comm, j}(z) to their respective polytopes gives a
total ordering over all operations in the memory access thread and can be used to auto-
matically reconstruct a loop nest which enumerates memory operations in the required
order using mature tools, for example, Cloog [93].
The scattering function is a linear function σ{comm, j}(z) : Z(n+2) → Zs mapping each vec-
tor z to an s-dimensional scattering vector. A lexicographic ordering over the s scattering
dimensions then determines the overall operation sequence. For eachmemory reference j
in the memory access thread, the scattering functions can be expressed as in (5.3), where
P{comm, j} is a (n + 2)× (n + 2) permutation matrix, C is an s× (n + 2) matrix and c{comm,j} is
a constant column vector.
σ{comm, j}(z) = CP{comm, j}z + c{comm,j} (5.3)
The scattering function in the datapath thread does not reorder the loops from the original
139
CHAPTER 5. PREDICTABLE MEMORY ACCESS SCHEDULING USING INTEGER
POINT COUNTING TECHNIQUES
program, and therefore can be give as in (5.4) where I is the (n+2)× (n+2) identity matrix.
σ{exec, j}(x) = CI + c{exec,j} (5.4)
The combination of the matrix C, which may contain empty columns and the column
vectors c{comm,j} or c{exec,j} which are specific to each memory reference j, means that
some dimensions of the s-dimensional scattering vector can be set to constant values.
In practical terms, this means we can define an unambiguous defined total order over
all the operations in the j polytopes S{exec, j} associated with the datapath thread and an
unambiguous defined total order over the operations in the polytopes S{comm, j} associated
with the memory access thread.
A predicate precedesj(i, i￿) can be defined to show whether after permutation (which
represents loop-reordering) a loop dimension is hoisted to a higher level of the loop nest.
If≺ defines a lexicographic comparison between two vectors and yi is defined as a column
vector with all elements zero except the ith row which is a one, the relationship is defined
as in (5.5).
precedesj (i, i￿) ⇐⇒ Pjyi ≺Pjyi￿ (5.5)
This predicate is used in the later sections of this chapter to help simplify notation. In
the next section, we show how we can use the scattering functions σ{comms, j}(z) along
with S{comm, j} which represents a set of memory operations to describe a set S{comm, j+m}
whose elements represent the ‘row delay’ operations which ‘precharge’ and ‘activate’
commands.
5.3.3. Representing Row Activation Delays as Additional Statements
In this section, we define a set S{comm, j+m} for each memory reference j, which represents
the ‘row delay’ operations associated with that reference.
The new polytope S{comm, j+m}, contains integer points representing each new row
140
CHAPTER 5. PREDICTABLE MEMORY ACCESS SCHEDULING USING INTEGER
POINT COUNTING TECHNIQUES
opened in the memory access thread. The row variable r is the (n + 1)th dimension
in the polytope S{comm, j+m}. We obtain the polytope describing the ‘row delay’ operations
for the jthmemory reference S{comm, j+m} by Fourier-Motzkin elimination of all dimensions
i from S{comm, j+m} for which the the predicate precedesj(i,n + 1) is false.
For our specific example with a read operation ( j = 1) and a write operation ( j = 2),
we give the specific steps for the transformation below. The application of the method
described in Chapter 4 to our specific example in Section 5.3.1 extracted a polytope
description from original source code and augmented it with dimensions representing
the rows r and bursts u accessed by each memory reference j.
For the j = 1 reference in the original source code in 5.1 which represents reads from the
array RA[], we apply the unimodular matrix U1 given in (5.6) to transform the original
source code, allowing subsequent safe elimination of the z1 and z2 dimensions without
creating ‘holes’. This gives the polytope S{comm,1} =
￿
z |Djz <= b￿1
￿
which represents the
‘burst read’ operations in thememory access thread ( j = 1). These ‘burst read’ operations
are enumerated in Table 5.1.
U1 =

1 0 0 0 0
−1 1 0 0 0
0 −1 −1 0 0
0 0 0 1 0
0 0 0 0 1

(5.6)
The application of the unimodular matrix U2 given in (5.7) to the j = 2 write access in the
original source codewhich represents datawritten to the array RB[], and subsequent elim-
ination of the z3 dimension gives S{comm,2} =
￿
z |D2z <= b￿2
￿
representing the ‘burst write’
operations in the memory access thread ( j = 2). These ‘burst write’ operations are also
enumerated in Table 5.1.
141




−1 0 0 0 0
1 −1 0 0 0
−8 8 1 0 0
0 0 0 1 0
0 0 0 0 1

(5.7)
The permutation matrices P1 and P2 in (5.8) are used to hoist the r row dimension to
the outermost loop level.
P1 =

0 0 0 1 0
0 0 1 0 0
0 0 0 0 1
1 0 0 0 0




0 0 0 1 0
1 0 0 0 0
0 1 0 0 0
0 0 0 0 1
0 0 1 0 0

(5.8)
This means that using the Cmatrix in (5.9) and the vector c{comm,j} = (2 j+1, 0, 0, 0, 0, 0)T
for j = 1 and j = 2, we can define the scattering functions σ{comm, j}(z) and obtain, using
Cloog [93] , the code shown in Figure 5.5. An appropriate constant in c{comm,j} and
c{comm,j+m} ensures that the row activation statement is always scheduled before data is
used from that row.
C =

0 0 0 0 0
1 0 0 0 0
0 1 0 0 0
0 0 1 0 0
0 0 0 1 0
0 0 0 0 1

(5.9)
The permutation matrices mean that the predicate precedes1(i,n + 1) is only true when
i = 4, which means we can form statement S{comm,m+ j} from S{comm, j} through Fourier-
Motzkin elimination of all dimensions in z except z4, giving (5.10).
142
CHAPTER 5. PREDICTABLE MEMORY ACCESS SCHEDULING USING INTEGER
POINT COUNTING TECHNIQUES
for ( r = 1 ; r <= 3 ; r++) {
for ( z3=max(−6 ,−2 ∗ 4r − 1 ) ; z3<=f loord (−16 ∗ r + 6 ,7 ) ; z3++) {
u = −4 ∗ r − 2 ∗ z3 ;
burs t read ( r , u ) ;
}
}
for ( z1=−2;z1<=0;z1++) {
for ( z2=−z1 − 2 ; z2<=min ( 0 ,−2 ∗ z1 − 2 ) ; z2++) {
u = f loord (−5 ∗ z1 − 2 ∗ z2 + 4 ,4 ) ;
bur s t wr i t e ( 3 , u ) ;
}
}








1 0 0 0 0
−1 0 0 0 0
0 1 0 0 0
0 −1 0 0 0
0 0 1 0 0
0 0 −1 0 0
0 0 0 1 0
0 0 0 −1 0
0 0 0 0 1
















When the predicate precedes2(i,n + 1) is evaluated, we find we can eliminate all dimen-
sions in z except z4 giving (5.11).
143








1 0 0 0 0
−1 0 0 0 0
0 1 0 0 0
0 −1 0 0 0
0 0 1 0 0
0 0 −1 0 0
0 0 0 1 0
0 0 0 −1 0
0 0 0 0 1
















for ( r = 1 ; r <= 3 ; r++) {
row delay ( ) ;
for ( z3=max(−6 ,−2 ∗ 4r − 1 ) ; z3<=f loord (−16 ∗ r + 6 ,7 ) ; z3++ {
u = −4 ∗ r − 2 ∗ z3 ;
burs t read ( r , u ) ;
}
}
row delay ( ) ;
for ( z1=−2;z1<=0;z1++) {
for ( z2=−z1 − 2 ; z2<=min ( 0 ,−2 ∗ z1 − 2 ) ; z2++) {
u = f loord (−5 ∗ z1 − 2 ∗ z2 + 4 ,4 ) ;
bur s t wr i t e ( 3 , u ) ;
}
}
Figure 5.6.: C source code for transformed memory access thread after ‘row delay’ state-
ments are inserted.
144
CHAPTER 5. PREDICTABLE MEMORY ACCESS SCHEDULING USING INTEGER
POINT COUNTING TECHNIQUES
If we select permutation matrices P3 = P1 and P2 = P4 and the constant vectors
c{comm,j} = ( j, 0, 0, 0, 0, 0)T for j = 2 and j = 3, code generation using the tool Cloog [93]
gives Figure 5.6 with ‘row delay’ statements inserted in the correct places.
In Section 5.4 that follows, we give notation leading to an exact integer representation of
the cycle in which each ‘burst read’ and ‘burst write’ operation is scheduled. We give the
corresponding notations scheduling each ‘datapath read’ and ‘datapath write’ operation
in the datapath thread. In Section 5.6.3, we apply the integer-point counting techniques
outlined in Chapter 3, these expressions take the form of piece-wise quasi-polynomials.
These can be used to form constraints which must be met to ensure correct dataflow.
5.4. Defining Parametric Subsets of Memory Operations
In this section we provide a formal description of the conditions which must be met
to ensure timely delivery of data into the on-chip memory buﬀer before it is required
by the datapath thread. We refer to these as ‘Fetch Conditions’. These conditions are
complemented by a set of constraints which ensure operations which write-back data
to oﬀ-chip SDRAM from the on-chip buﬀer are only scheduled after the data has been
produced by operations in the datapath thread. These will be referred to as ‘Store Con-
ditions’. is a set of non-linear quasi-polynomial inequality constraints which represent
the ‘Fetch Conditions’ and ‘Store Conditions’ and define correct data-flow between the
two concurrent threads. The constraints are subsequently used in Section 5.6.3 to derive
optimal delay constants which guarantee safe scheduling of concurrent operations in the
communication and datapath threads.
In this section we build up a series of definitions describing diﬀerent parametric sets.
The end goal of this is to arrive at three diﬀerent set descriptions highlighted in Table 5.4.
Ω{comm, j}(x) is referred to as the Full Statement Enumeration. The parameter t from
Chapter 4 specifies the granularity at which memory operations are reordered. Recall
that each iteration x is made up of outer loop dimensions
￿
x1, x2 . . . xt−1
￿
, and inner loop
dimensions
￿
xt, xt+1 . . . xn
￿
. Ω{comm, j}(x) is a parametric set whose elements are all mem-
145
CHAPTER 5. PREDICTABLE MEMORY ACCESS SCHEDULING USING INTEGER
POINT COUNTING TECHNIQUES
Table 5.4.: Useful memory operation subsets.
Symbol Name
Ω{comm, j}(x) Full Statement Enumeration
ω<{comm, j}(x) Lexicographic Partial Statement Enumeration for Memory Access Thread
ω<{exec, j}(x) Lexicographic Partial Statement Enumeration for Datapath Thread
ory operations (which might be ‘burst read’, ‘burst write’ or ‘row delay’ operations) of
memory reference j which are accessed during the current outer loop, specified as the
dimensions
￿
x1, x2 . . . xt−1
￿
of parameter x.
ω<{comm, j}(x) is the Lexicographic Partial Statement Enumeration. It is a subset of
Ω{comm, j}(x) which only contains memory operations which, after scheduling all oper-
ations with σ{comm, j}(x), occur before data is fetched for datapath iteration x.
In this sectionwedefine how the Full Statement Enumeration andLexicographic Partial
Statement Enumeration can be built out of the union and intersection of other parametric
sets. The Full Statement Enumeration and Lexicographic Partial Statement Enumeration
sets are then used in Section 5.5 to find non-linear functions which specify the exact clock
cycle in which a memory operation is scheduled.







We define θcomm, j(x, p) as a parametric subset ofZn whose elements are defined according
to equation (5.13).
θ{comm, j}(x, p) =
￿




CHAPTER 5. PREDICTABLE MEMORY ACCESS SCHEDULING USING INTEGER
POINT COUNTING TECHNIQUES
This equation defines a parametric plane through Zn. Similar relations θ<{comm, j}(x, p)
andθ≤{comm, j}(x, p) can be defined as in equations (5.14) and (5.15) respectivelywhich define
parametric halfplane subspaces of Zn.
θ<{comm, j}(x, p) =
￿
z | z ∈ Zn,αp, jx < zp
￿
(5.14)
θ≤{comm, j}(x, p) =
￿
z | z ∈ Zn,αp, jx ≤ zp
￿
(5.15)
The Full Statement Enumeration can be formed from a combination of these parametric
sets. It is defined an intersection of parametric plane inequalities as in (5.16). We refer to





θcomm, j(x, p) ∩ S{comm, j}
￿
(5.16)
We can use the integer point counting techniques in Chapter 3 to count the elements
in the Full Statement Enumeration Ω{comm, j}(x), obtaining #{comm, j}(x), a piece-wise quasi-
polynomial function whose domain is x and whose range gives the number of operations
within a specific outer loop iteration x. Depending on the value of j, these may be
‘burst read’, ‘burst write’ or ‘row delay’ operations.
ω<{comm, j}(x, i) is formed from an intersection of planes and half-planes and is used as a
building block to form the lexicographic enumeration relation ω<{comm, j}(x).
We define the set of loop dimensions, v ∈ Vj, as those dimensions t ≤ v ≤ n which are
not eliminated in Section 5.3.1 or Section 5.3.3 and define v￿ ∈ V{i, j} as the subset of V{i}
for which the predicate precedes{comm, j}(v, i) is true.
We define the partial enumeration, ω<{comm, j}(x, i) as in (5.17).








CHAPTER 5. PREDICTABLE MEMORY ACCESS SCHEDULING USING INTEGER
POINT COUNTING TECHNIQUES
ω<{comm, j}(x) is the lexicographic partial enumeration of the loop reference j under param-
eterisation t. It is formed from a union of partial enumerations as in (5.18). It contains all
elements of S{comm, j} that are scheduled before the burst operation in which the data for
datapath iteration x is read from oﬀ-chip memory.
ω<{comm, j }(x) =
￿
v ∈V
θ<{comm, j}(x, v) ∩Ω{comm, j}(x) (5.18)
Applying the integer point counting techniques in Chapter 3 to count the element in
ω<{comm, j}(x) gives #<{comm, j}(x), a piece-wise quasi-polynomial function whose domain is x
and whose range gives the number of burst or delay operations which occur in an outer
loop iteration, before data is loaded for use in the datapath iteration x.
An identical relationship for the datapath thread can be derived. ω<{exec, j}(x) contains
the iterations which are scheduled lexicographically before iteration x.
In section 5.5, we use the counting functions #{comm, j}(x) and #<{comm, j}(x) to determine
the exact cycle in which each data transfer in the memory access thread takes place.
5.5. Representing Exact Cycle Scheduling using Parametric Sets
In the preceding section, we described howwe can build two parametric sets from amem-
ory reference description, the full statement enumeration and the partial lexicographic
enumeration set. Using integer counting techniques, a function giving the number of
elements in each set can be constructed. In this section we show howwe can combine the
counting functionswith constant parameters representing the duration of each statement.
In doing so, we derive a scheduling function for each memory reference, which gives the
exact clock cycle in which the operation begins (relative to the start of the current outer
loop tile)
Constants define the duration of each statement (in clock cycles). We may represent
the length of each statement as a constant, D{comm, j}. In our example from Section 5.1, we
selected 2 clock cycles as the representative length of the ‘read burst’ and ‘write burst’ op-
erations (D{comm,0} = 2 andD{comm,1} = 2) and 10-clock cycles as the length of the ‘row delay’
operations (D{comm,2} = 10 and D{comm,3} = 10).
148
CHAPTER 5. PREDICTABLE MEMORY ACCESS SCHEDULING USING INTEGER
POINT COUNTING TECHNIQUES
We may denote the total time period needed to transfer all the data for a specific read
or write reference (1 <= j <= m) and outer loop iteration (x) as in (5.19). This includes
the ‘row delay’ penalty for changing row and the time taken for each ‘read burst’ or
‘write burst’ scheduled in this outer loop.
D{comm, j}(x) =
Duration of ‘burst’￿￿￿￿￿ ￿￿￿￿￿
D{comm, j} #{comm, j}(x)￿￿￿￿￿￿￿￿ ￿￿￿￿￿￿￿￿
Number of ‘bursts’
+
Duration of ‘row delay’￿￿￿￿￿￿￿￿ ￿￿￿￿￿￿￿￿
D{comm,m+ j} #{comm,m+ j}(x)￿￿￿￿￿￿￿￿￿￿￿ ￿￿￿￿￿￿￿￿￿￿￿
Number of ‘row delays’
(5.19)
The specific time oﬀset t{comm, j}(x) for some iteration x in a specific memory reference j
can be calculated as in (5.20)
t{comm, j}(x) = D{comm, j}#<{comm, j}(x)￿￿￿￿￿￿￿￿￿￿￿￿￿￿￿￿￿￿￿ ￿￿￿￿￿￿￿￿￿￿￿￿￿￿￿￿￿￿￿
Prior Bursts in j
+D{comm, j+m}#<{comm, j+m}(x)￿￿￿￿￿￿￿￿￿￿￿￿￿￿￿￿￿￿￿￿￿￿￿￿￿￿ ￿￿￿￿￿￿￿￿￿￿￿￿￿￿￿￿￿￿￿￿￿￿￿￿￿￿






And the specific time oﬀset of an iteration in the datapath thread may be derived as in
(5.21)
tj(x) = Dexec#<{exec, j}(x)￿￿￿￿￿￿￿￿￿￿￿￿ ￿￿￿￿￿￿￿￿￿￿￿￿
Prior Iterations
(5.21)
Together these timing functions allowus to construct inequalitieswhichdescribe correct
dataflow between the communication and datapath thread. We can ensure that the
equationswill hold by adding constantsDe andDw. If wewish to ensure correct execution
semantics, we form the read constraints as in (5.22).
t{comm, j}(x) < t{exec, j}(x) + Te ∀x ∈ S{exec, j} j = 1 . . . d (5.22)
and write constraints as in (5.23).
t{comm, j}(x) + Trw > t{exec, j}(x) ∀x ∈ S{exec, j} j = d + 1 . . . d + w (5.23)
149
CHAPTER 5. PREDICTABLE MEMORY ACCESS SCHEDULING USING INTEGER
POINT COUNTING TECHNIQUES
If we can form these equations and prove that they hold for all values of x in the
datapath thread, then it demonstrates a valid overlapping of datapath and execution
operations.
The minimum value of the constants Te and Trw provides optimal overlapping of
datapath and communication threads. In the section that follows, we describe how we
can use Bounding Techniques to determine optimal values.
5.6. Finding Bounds for Quasi-Polynomial Expressions
In the preceding section (Section 5.5), we formulated piece-wise quasi-polynomial in-
equalities for each of the j memory references within a loop nest. These represent the
‘FetchConstraints’ and ‘StoreConstraints’whichmust bemet to ensure correct concurrent
execution in a memory access thread and a datapath thread.
These inequalities must hold for all integer values of x ∈ S{exec, j} to ensure every
datapath iteration reads the correct data from external memory, and correctly stores
calculated values back into memory. The inequalities representing ‘Fetch Constraints’
and ‘Store Constraints’ take the form of piece-wise quasi-polynomials functions ; that is
quasi-polynomials defined on disjoint partitions of S{exec, j}.
Finding global bounds for the range of polynomial functions is an area of active contem-
porary research, and newmethods in the field of algebraic geometry have in recent years
substantially improved the scalability of what is known to be an NP-hard non-convex
optimization problem. Popular approaches include Interval Methods such as Skelboe-
Moore [100] and methods based on Sum-of-Squares decomposition [101, 102]. In this
section, we select a third possible method, decomposition using Bernstein polynomial
decomposition [1, 103].
In this section, we first provide a short introduction to how Bernstein Decomposition
can be used to find lower bounds on a polynomial over an interval, and demonstrate some
useful properties of the method over other bounding approaches. In section 5.6.2, we
present a method from [104] which enables us to find polynomial bounds over a convex
polytope rather than an interval. For an explanation of how the technique can be further
150
CHAPTER 5. PREDICTABLE MEMORY ACCESS SCHEDULING USING INTEGER
POINT COUNTING TECHNIQUES
extended, through approximations, to bound Quasi-Polynomial functions, we direct the
reader to [84].
5.6.1. Bernstein Decomposition over an Interval
Bernstein decomposition takes as input a d-order polynomial p(x) and finds a represen-
tation of that polynomial as a linear combination of d Bernstein basis polynomials with



















The Bernstein basis polynomials have two interesting properties [1] :
1. The sum of the Bernstein basis polynomials is 1
2. On the interval [0,1], 0 ≤ Bdk(x) ≤ 1..
These properties together mean that the range of a polynomial p(x) is bounded by the
values of the coeﬃcients bi in its Bernstein decomposition. This is illustrated in Figure 5.7
from [1] which shows a fifth-degree (d = 5) polynomial and the convex hull formed by
its Bernstein coeﬃcients (bi). From this figure, two observations can be made. Firstly
that the minimum Bernstein coeﬃcient (in this case b2) is a lower bound on the range of
the polynomial p(x) over the interval x ∈ [0, 1]. Secondly, that the 0-th and d-th Bernstein
coeﬃcient are values which lie in the range of the polynomial p(x) over x ∈ [0, 1]. This
second point means that if the smallest Bernstein coeﬃcient is b0 or bd then the lower
bound it represents is known to be tight.
This is a useful property in evaluating the quality of our overall solution, since non-tight
bounds lead to memory access threads with unnecessary idle cycles.
151
CHAPTER 5. PREDICTABLE MEMORY ACCESS SCHEDULING USING INTEGER
POINT COUNTING TECHNIQUES
!
Figure 5.7.: Fifth Degree Bernstein Polynomial from [1] showing convex hull property.
The computation of Bernstein Coeﬃcients for multivariate Polynomials can be per-
formed using a variant of the ‘De Casteljau’ algorithm [105] which has been shown to




[106] where p is the polynomial order and
n is the number of diﬀerent variables, in our case, the number of nested loops in our
original source code. While this would suggest very poor scaling of the procedure over
general problems, the number of loop levels in most practical loop-nests is small ( < 8 )
and in practical experiments, the polynomial order has also been shown to be small. Bern-
stein polynomial decomposition provides two-possible methods for trading oﬀ bounding
accuracy for scalability, either through subdivision of the space over which a polyno-
mial is expanded (which increases the tightness of bounds) or through degree-elevation
(increasing the degree of the Bernstein polynomial basis set increases the tightness of
bounds).
Having demonstrated the basic bounding approach over an interval, we describe a
method from [104] which extends the technique to find lower bounds over a convex
polytope. Thismeans that we can evaluate the lower bound of each piece in our piecewise
quasi-polynomial representation of the ‘Fetch Conditions’ and ‘Store Conditions’.
152
CHAPTER 5. PREDICTABLE MEMORY ACCESS SCHEDULING USING INTEGER
POINT COUNTING TECHNIQUES
5.6.2. Bernstein Decomposition over a Convex Polytope
We can generalize the Bernstein decomposition over an interval to expansion over a
convex hull. An example of this expansion over triangles is referred to as the Bernstein-
Bezier form [105] and is widely in computer graphics. Here we demonstrate a more
general form from [104] using expansion over a convex polytope.
We exploit the property that a convex polytope is the convex hull of its i vertices. This
property is shown in (5.25).
P =







Ifwewish to compute bounds on amultivariate polynomial p(x) over a polytope (x ∈ P),
we can substitute the expression in 5.26 into our polynomial and expand the resulting





We can then compute coeﬃcients bdk for k = k1, k2, . . . ki , 0 ≤ ki,
￿
ki = d which form an
aﬃne combination with generalized Bernstein base polynomials Bdk . These terms are the
terms in the expansion
1 = (α1 + α2 + . . . + αn)d. (5.27)
This allows us to determine lower bounds for the polynomial p(x) over a convex polytope
rather than an interval. We can use this method to find a lower bound for our ’Fetch
Conditions’ and ‘Store Conditions’. In our implementation, we use the ‘Barvinok’ li-
brary [84] and the Bernstein Bounding functions within it to find these lower bounds.
In Section 5.6.3 we use these lower bounds to construct communication and datapath
threads which allow safe concurrent operations.
153
CHAPTER 5. PREDICTABLE MEMORY ACCESS SCHEDULING USING INTEGER
POINT COUNTING TECHNIQUES
5.6.3. Overlapping Data Transfer and Computation
In Section 5.5, we showed a representation of the ‘FetchConditions’ and ‘StoreConditions’
necessary for concurrent operations in a memory access thread and datapath thread as
Quasi-Polynomial inequalities. In Section 5.6, we then showed a method for finding
lower bounds on those inequalities.
A ‘FetchCondition’ requires thatmemory items are loaded in thememory access thread
before they are consumed by the datapath thread. For each read request (1 ≤ j ≤ d), a lower
bound determined in Section 5.6 ensures the ‘Fetch-Condition’ holds for all x ∈ S{comm, j}.
In this case, we can insert a delay in the datapath thread to ensure the ‘Fetch Conditions’
are met for all read references 1 ≤ j ≤ d and all x ∈ S{exec, j}. The duration of the delay
determined as in (5.28) where lbj is the lower bound on the j’th quasi-polynomial.
Te = −min
j
v s.t. v = lbj, j = 1 . . . d (5.28)
A ‘Store Condition’ requires that memory items are stored in the memory access thread
after they are produced by the datapath thread. For each write request (d+ 1 ≤ j ≤ d+w),
a lower bound determined in Section 5.6 ensures the ‘Store-Condition holds for all x ∈
S{comm, j}.
In this case, we can insert a delay in the memory access thread to ensure the ‘Store
Conditions’ are met for allwrite references d ≤ j ≤ d + w and all x ∈ S{exec, j}. The duration
of the delay determined as in (5.29).
Trw = −min
j
v s.t. v = lbj, j = d + 1 . . . d + w (5.29)
The insertion of these delays into the program ensures that concurrent operations can
occur in the datapath andmemory access threadwhilst preserving correct semantic intent
in the original program.
When these delays are inserted into our example source code from 5.1, our automatic
procedure derives the code shown in Figure 5.3 for the datapath thread, with an execution
delay of Te = 35 and the code shown in Figure 5.2 with a turnaround delay of Trw = 0.
154
CHAPTER 5. PREDICTABLE MEMORY ACCESS SCHEDULING USING INTEGER
POINT COUNTING TECHNIQUES
Theproceduredescribed in this chapter is generalizable and canbeused to optimize any
perfectly nested loop nest which can be expressed in the Polytope Model. The loop nest
can contain any number of memory read and write references and any tiling parameteri-
sation t. The timing parameters Ta representing the duration of each datapath iteration,
Tu representing the number of cycles required for each ‘burst read’ or ‘burst write’ com-
mand and Tr representing the number of cycles required for a ‘row delay’ statement are
inputs to the procedure which allow exploration of a range of diﬀerent implementation
architectures.
In Section 5.7, we show how the constant Trw can be used to determine the complete
execution time of the task. This is followed in Section 5.8, where we apply the technique
developed in this chapter to our example program fromFigure 5.1, choosing diﬀerent tim-
ing parameters to demonstrate the impact that overlapping datapath and communication
can have on overall runtime.
5.7. Compile Time Evaluation of Overlapped Task Execution
Time
When the ‘Fetch Conditions’ and ‘Store Conditions’ are met, there is a guarantee that
data-flow between the memory access thread and datapath thread is correct. The use of
Bernstein decomposition based bounding techniques, as demonstrated in Section 5.6.3
enables us to determine constants Trw and Te which guarantee the ‘Fetch Conditions’ and
‘Store Conditions’ are met throughout the program execution. We can also use these
constants to derive at compile time, the exact completion time of the computational task.
Crucially, we can incorporate the memory delays associated with ‘precharge’ and ‘ac-
tivate’ commands into our analysis, because these have been modelled by ‘row delay’
statements inserted into our memory access thread. The Trw delay calculated in Sec-
tion 5.6.3 enables us to model delays caused by the interaction between the datapath
thread and memory access thread, which means we can determine the overall execution
time of a computational task where memory transfer and execution operations occur con-
155
CHAPTER 5. PREDICTABLE MEMORY ACCESS SCHEDULING USING INTEGER
POINT COUNTING TECHNIQUES
currently. To our best knowledge, this is the first work to provide an analytical method
for evaluating task runtime incorporating both memory access delays and overlapped
memory access and datapath operations.
We measure the complete execution time of a task by considering the time which
elapses between the first and last operations which are scheduled in the memory access
thread. Before the first operation in the memory access data thread, no data is present
in on-chip memory, and after the last operation, all data calculated during the task will
have been committed to oﬀ-chip memory. The execution time of complete task can be
determined solely with reference to the memory access thread, since all interaction with
the datapath thread is accounted for in the Trw ‘turnaround delay’ operations, inserted to
legalise overlapped execution.
Each operation in the memory access thread has a constant duration. We can therefore
determine overall execution time by counting the number of operations of each type
(‘burst read’, ‘burst write’, ‘turnaround delay’ and ‘row delay’) and multiplying by the
respective duration of that type.
Using the notation developed earlier in the chapter, we define the ‘turnaround delay’
statement index as j = 2m+1. Wedefine the setS{comm,2m+1} as the polytopewhose enclosed
integer points represent each ‘turnaround delay’ statement. This set is equivalent to the
original source code iteration set {x |Ax ≤ b} with the dimensions xt . . . xn projected out.
If we apply non-parametric integer counting techniques to this set, we obtain an integer
value which is the number of ‘turnaround delay’ operations in the complete task. We
call this integer #{comm,2m+1} The application of non-parametric counting techniques to the
sets representing ‘read burst’ , ‘write burst’ and ‘row delay’ operations gives equivalent
integer values #{comm, j} representing the number of each type of operations. These can be
combined as in (5.30) to give the total runtime of the task.
156





#{comm, j} ∗ Tu￿￿￿￿￿￿￿￿￿￿￿￿￿￿￿ ￿￿￿￿￿￿￿￿￿￿￿￿￿￿￿




#{comm, j} ∗ Tu￿￿￿￿￿￿￿￿￿￿￿￿￿￿￿￿￿ ￿￿￿￿￿￿￿￿￿￿￿￿￿￿￿￿￿




#{comm, j} ∗ Tr￿￿￿￿￿￿￿￿￿￿￿￿￿￿￿￿￿ ￿￿￿￿￿￿￿￿￿￿￿￿￿￿￿￿￿
Time for ‘row delay’ operations
+ #{comm,2m+1} ∗ Trw￿￿￿￿￿￿￿￿￿￿￿￿￿￿￿￿ ￿￿￿￿￿￿￿￿￿￿￿￿￿￿￿￿
Time for ‘turnaround delay’ operations
(5.30)
5.8. Results
We demonstrate our approach on diﬀerent parameterisations of the code in Figure 5.1.
In Table 5.5, we show the constants selected by the tool for each diﬀerent reuse level.
Alongside thiswe show the total time for executionwith andwithout overlappedaccesses.
For the t = 1 and t = 4 parameterisation, we have zero turnaround delay, which means all
operations in the datapath thread overlap with operations in the memory access thread.
In those parameterisations, we see a performance improvementwhenmemory operations
are overlappedwith operations in the datapath, with a 10.77% performance improvement
and 4% performance improvements in the total time needed to complete all datapath
operations and write data back to memory. For the t = 2 and t = 3 parameterisations, we
see a drop in performance when we overlap execution with datapath operations. This
happens because the bounding procedure finds a constant ‘turnaround delay’ oﬀset to
be applied in each outer loop iteration. For examples such as this one, where the number
of inner loop iterations varies depending on the outer loop iteration value, this constant
oﬀset may lead to ineﬃcient performance in some outer loops.
In the t = 4 parameterisation, where no reuse or reordering of memory operations is
exploited, and there is a single read and write in the innermost loop, the performance
of the program is improved because execution can be overlapped with a ‘row delay’
157
CHAPTER 5. PREDICTABLE MEMORY ACCESS SCHEDULING USING INTEGER
POINT COUNTING TECHNIQUES
Table 5.5.: Results showing impact of overlapping in datapath and execution thread (Each














1 0 35 58 65 +10.77%
2 6 23 120 109 -10.09%
3 5 25 132 119 -10.92%
4 0 13 168 175 +4.00%
Table 5.6.: Results showing impact of overlapping in datapath and execution thread (Each














1 0 21 58 86 +32.56%
2 6 23 120 130 +7.69%
3 4 19 128 140 +8.57%
4 0 13 168 196 +14.29%
operation in the memory access thread.
More favourable outcomes emerge if the duration of each datapath cycle is increased.
In Table 5.6, each datapath iteration takes four cycles to complete. Here we see a more
significant performance increase because proportionally more datapath iteration cycles
overlap with memory operations. In Table 5.7, each datapath iteration takes eight cycles
to complete, the performance improvement over non-overlapped code increases even
further, reflecting the increasing ratio between the execution time and memory access
time.
More significantly, in all parameterisations of Table 5.6 except t = 1, the total time for
executionwith overlapping is less or equal to the time for the completion of the equivalent
parameterisation from Table 5.5. This means that we can exploit multicycle-operations in
the datapath to reduce area or save power and still end upwith an implementation which
completes in fewer clock cycles. This interesting behaviour arises because the memory
operations are the performance bottleneck in this example code.
When we tried applying the same bounding technique to the example benchmarks
158
CHAPTER 5. PREDICTABLE MEMORY ACCESS SCHEDULING USING INTEGER
POINT COUNTING TECHNIQUES
Table 5.7.: Results showing impact of overlapping in memory access and datapath thread














1 22 15 80 114 +29.82%
2 8 19 126 158 +20.25%
3 8 15 144 168 +14.29%
4 0 13 168 224 +25.00%
given in Chapter 4, the procedure did not finish within two hours of runtime. We had
expected that the bounding procedure time would scale exponentially with the number
of variables in our quasi-polynomials. We had also expected that the procedure would
scale exponentially with the number of vertices in each parametric chamber over which
we decomposed the quasi-polynomials. However, both the number of vertices of each
chamber and the number of variables in each quasi-polynomial are related to the number
of loop bounds, which is small.
In some exploratory work to try and discover why the quasi-polynomial bounding
has such poor scaling, we investigated the number of diﬀerent disjoint pieces in each
quasi-polynomial for the MMM benchmark and our example from Section 5.2 (labelled
EX5). These are shown in Table 5.8. From this table, it is obvious that there are many
more polynomial pieces in the quasi-polynomials for the MMM example than in EX5.
We tried bounding each quasi-polynomial piece individually and from this we form the
hypothesis that our poor runtime scaling is attributable to the number of ‘fractional’
expressions in each quasi-polynomial. Themost probable explanation for this is that each
fractional expression, whose range is [ 0 : 1 ), is represented in the bounding procedure
by a new variable, they therefore add additional bounded dimensions for the Bernstein
decomposition. Since the Bernstein bounding procedure scales exponentially with the
number of variables, we should expect this to explain the unacceptably long runtimes
observed.
Wenote that becauseweare trying tobound the ‘Fetch-Conditions’ and ‘Store-Conditions’
over discrete integer points enclosed by a polytope, it is possible to enumerate each point
159
CHAPTER 5. PREDICTABLE MEMORY ACCESS SCHEDULING USING INTEGER
POINT COUNTING TECHNIQUES





Number of Pieces in Memory Condition
MMM 1 70 46 48 48
MMM 2 35 16 16 16
MMM 3 9 8 2 2
MMM 4 1 1 1 1
EX5 1 6 6 - -
EX5 2 6 6 - -
EX5 3 2 2 - -
EX5 4 1 1 - -
and evaluate the conditions to find a minimum bound. However, doing so would imply
a runtime for the procedure which scales with the number of iterations within our source
polytope. In Chapter 6, we discuss some promising ideas for improving the runtime of
the polynomial bounding procedure.
5.9. Conclusion
In this section, we have derived conditions which allow for the overlapping of concurrent
operations in memory access thread and datapath thread. We applied a known poly-
nomial bounding technique to demonstrate the eﬀectiveness of the technique on a small
example code. The results show up to 32% improvement in performance and are in some
cases, we are able to overlap operations such that all the datapath cycles are executed
concurrently with operations in the memory access thread.
One obvious benefit of the technique is the reduction in overall runtime demonstrated,
however a secondary benefit is that we can show the impact of reducing the datapath
throughput at compile-time. Reduced throughput requirements means area-eﬃcient
datapath implementations using techniques such as bit-serial arithmetic, or resource
sharing become feasible.
Furthermore, the technique allows the user to derive exact analytical bounds on the
completion time of the loop-nest. This is useful compile time information which allows
160
CHAPTER 5. PREDICTABLE MEMORY ACCESS SCHEDULING USING INTEGER
POINT COUNTING TECHNIQUES
bounds on memory access time to be determined without recourse to expensive sim-
ulation techniques. This feature allows the use of external memory in systems where
real-time periodic deadlines must be met and is a step towards the vision outlined in
Chapter 1, in which high level synthesis tools can map logical memories to external
memory and tightly-couple external memory controllers into datapath design.
Whenpolynomial bounding techniques are used to determine parameters for legalizing
the overlapping of memory access and datapath threads, the time-complexity of the
program analysis should grow with the dimension of the loop-nest not the absolute
number of elements in the program, whichmakes our technique favourable to simulation
when large kernels must be analysed. In practical experiments, it has not been possible
to verify this scaling trend because the constant time complexity scaling factor of the
bounding procedure is large. In Chapter 6, we propose further research directions which
can trade-oﬀ the quality of the bounds to achieve faster analysis runtime and make
analytical bounding of quasi-polynomials practical for realistic problems.
While the practical problems of bounding the polynomials initially limit the applicabil-
ity of this work, we emphasise that the formulation of analytical expressions describing
‘Fetch Constraints’ and ‘Store Constraints’ is an important and novel contribution, since
manual inspection of these expressions gives insight into how the schedules in the mem-
ory access and datapath threads might be improved that isn’t readily apparent from
simulation.
In Chapter 6 which forms the final chapter of this thesis, we summarise the key thesis
achievements and propose future work to broaden the applicability of the work.
161
6. Conclusion
This thesis has proposed methods for designing application specific memory controllers
to maximise eﬃcient use of oﬀ-chip memory bandwidth. In this chapter, we first provide
a summary of our key achievements. We follow this with some suggestions for future
research directions and make some final concluding remarks.
6.1. Summary of Key Thesis Achievements
Throughout the preceding chapters, we have developed techniques for the systematic
design of DRAMmemory controllers using a static compile time analysis of the memory
accesses within nested loop code kernels. We havemotivated our research by considering
historical and projected scaling trends which show a widening gap over time between
achievable oﬀ-chip memory bandwidth and the exponentially increasing silicon area in
which on-chip memory and logic resources may be implemented. These trends imply
that it is a valuable activity to design sophisticated memory controllers. By modelling
the physical ‘row’ structures and modelling the ‘burst’ behaviour of dynamic memories
as described in Section 4.5 of Chapter 4, we enable loop reordering transformations to
be applied, which ensure eﬃcient use of the DRAM interface. In experiments with code
kernels from real-world applications, data-reuse reduced the number of oﬀ-chip memory
accesses and command reordering was shown to improve bandwidth eﬃciency by up to
4×.
The second, and related contribution in Chapter 4, is the practical realisation ofmemory
controllers exploitingdata-reuse and command reordering. Wedemonstrate twomethods
in Section 4.4 and Section 4.5 in which hardware can be automatically derived to generate
162
CHAPTER 6. CONCLUSION
an eﬃcient sequence of memory accesses. This is a step towards a long-term vision of
integrating DRAM command scheduling and datapath design, rather than decoupling
the two steps completely with the use of DRAM controller IP with non-deterministic
handshaking interfaces.
The third novel contribution is a characterisation of the essential constraints which
define a coupling between logic dedicated to oﬀ-chip data-transfer and logic implement-
ing an application-specific datapath. In Chapter 5, we form quasi-polynomial inequality
constraints which characterise the necessary coupling between these two activities. The
definition of this coupling as a algebraic relationship allows greater insight into how
performance might be improved than might be gained from a purely simulation-based
approach. We show in Section 5.6 that an optimal solution can be found using Bern-
stein polynomial decomposition which enables us to reduce overall execution time by
overlapping memory access and datapath operations.
A fourth notable contribution of this thesis is that it provides a method of determining
at compile time, the total execution time of a code kernel including realistic modelling of
memory access time. Because our approach to this problem in Section 5.7 allows execution
time to be determined without simulation, is ideally suited to evaluating application per-
formance in the early stages of a design project. Furthermore, since the time-complexity of
the technique varies with structural parameters of the targeted loop nest, rather than the
number of iterations within it, it is likely to scale better than a simulation based approach
when the number of loop iterations is large.
To put these achievements in context, we highlight some future research directions
which overcome diﬃculties faced within our research and expand the scope of the work.
6.2. Suggested Future Research Directions
In this section, we suggest future research directions to overcome limitations presented
within the thesis, possible complementary methods for improving DRAM performance




6.2.1. Exploration of Scalable Bounding Methods for Quasi-Polynomials
In Chapter 5, we discussed methods for forming quasi-polynomial inequality constraints
representing the ‘Fetch Constraints’ and ‘Store Constraints’ which capture the interaction
between two communicating threads. We used Bernstein Polynomial decomposition to
find bounds for those quasi-polynomial functions in order to determine optimum values
for Trw and Te : parameters which legalised dataflow between the two threads. The time
taken to bound these polynomials should vary with the problem description (i.e. the
number of nested loops and the number of parametric chambers in their chamber decom-
position). For large problems, this approach should be faster than methods dependent
on a simulation-based approach to determining legal parameters for Trw and Te.
However, in practice, bounding these polynomials using our existing tools is unaccept-
ably slow. Much could be done to accelerate the procedure. As one example, polynomial
bounds can be found independently for each piecewise section of the quasi-polynomial.
This means a parallel speed-up can be achieved by making use of multiple-cores in a
workstation, or multiple nodes in a computer cluster.
Another approachwould be to trade-oﬀ the quality of the bounds achieved for runtime.
TheBernstein bounding implementationwehaveused (fromBarvinok [84] ) expands each
quasi-polynomial over the chamber in which it is valid, with the number of dimensions
in the expansion defined by the number of vertices in the chamber. A more scalable
approach might use bounding box relaxations to reduce the number of variables in the
Bernstein expansion and improve runtime (at the expense of looser bounds upon the
target polynomial). There is likely to be some value in studying the specific structure of
quasi-polynomials produced from the enumeration of polytopes, as this may give insight
into specializedways of bounding themwhich are not applicable in finding lower bounds
for general polynomials. A study of diﬀerent polynomial bounding techniques framed in




6.2.2. Exploration of Determistic DRAM Refresh Options
During this thesis, we considered the timing behaviour of DRAM behaviour with respect
to the cost of opening and closing memory rows but have made little mention of the the
other important function of an SDRAM controller : the need to manage the issuing of
‘refresh’ commands to the DRAM device. Our key reasoning behind this omission is
that the bandwidth cost of opening and closing rows is small ( 2%). The most significant
impact of ‘refresh’ commands is in increasing the average and worst case latency of
memory systems. The increase in average latency can hurt performance in cachememory
systems within a general purpose processor because the datapath must stall waiting for
data. The large increase in ‘worst-case’ latency caused by a refresh commands is also very
significant in real-time systems with periodic tasks, since they must often consider the
worst-case memory latency when determining an achievable task period. However, in
loop nests with a known sequence of operations, commands can be issued in a pipelined
fashion to the external DRAM device (since their sequence is predetermined) and so we
have concentrated our eﬀorts on optimizing bandwidth rather than latency.
However, there is certainly the opportunity to schedule refresh operations to minimize
the number of ‘precharge’ operations they require and minimize their impact on overall
memory bandwidth. There is also the opportunity to hide the latency of SDRAM refresh
operations by exploiting access to multiple ‘ranks’ of DRAM devices (memory devices
sharing with a common address and data bus, but individual chip select lines). This
would allow ‘refresh’ operations in one rank to take place while simultaneous read or
write accesss took place in another rank.
Finally, an exciting opportunity when variable life-times are well known (for instance,
in code expressed in the Polytope Model) is selective refresh of data based on its live-out
property. Data which we know will not be accessed again need not be refreshed by the
SDRAM controller. An intelligent SDRAM controller could use this property to reduce
the bandwidth cost of DRAM ‘refresh’ operations.
165
CHAPTER 6. CONCLUSION
6.2.3. Derive SDRAM Bank Partitioning Scheme fromMemory Access
Schedule
This PhD thesis has concentrated on demonstrating opportunities for reducing the cost
of memory access within a single DRAM bank. By grouping together accesses made to
the same row, we have demonstrated in Chapter 4 that we can reduce the number of
‘precharge’ and ‘activate’ commands and their associated impact onmemory bandwidth.
A further optimization, enabledby thegrouping together of accesses to the same row, is the
mapping of consecutively accessed rows to diﬀerent banks. The explicit representation of
DRAMrowaccesses in thePolytopeModel introduced inChapter 4provides amechanism
through which we can express the order of SDRAM row accesses. We can express each
pair of consecutive accessed rows as a tuple, and the set of tuples can be derived by set
operations on a generating function representation demonstrated in Chapter 3. Future
work might explore how a suitable mapping can be derived from this representation to
ensure that those consecutively accessed rows are mapped to diﬀerent SDRAM banks.
If such a mapping can be eﬃciently found, the likely performance impact is significant,
since all ‘precharge’ and ‘activate’ commands could be hidden.
While eliminating ‘precharge’ and ‘activate’ commands might seem to obviate some
of the work presented in the thesis, the Tfaw four-activation window and Trrd ‘activate-
activate’ timing constraints from Table 2.3 of Chapter 2 means that eﬃcient access to
multiple banks is only practical when multiple bursts of data are transferred from each
activated SDRAM row. It is therefore likely that techniques from this thesis, which
increase the number of consecutive accesses to each row, play a significant role in enabling
future research into practical bank partitioning schemes.
6.2.4. Integrating NAND Flash Memory into Design Flow
NAND Flash Memory is a non-volatile memory technology used in storage applications.
Flash memory is competitive with hard disk storage in markets where reliability and
speed, especially random access speed is valued. Flash memory arrays are arranged as
166
CHAPTER 6. CONCLUSION
a hierarchy of pages, blocks and planes similar to a DRAM structure. Unlike a DRAM
interface, NAND flash memory interfaces multiplex data and addresses on the same
pins. After issuing a ‘read’ command to the memory along with five-bytes of address
data on consecutive cycles, a delay (typically 25µs) is necessary to transfer a ‘page’ of data
from the array of floating-gate transistors into an data-register. Data can be either read
sequentially (with one byte transferred every 30ns) or from random addresses within the
page, typically requiring a 120ns cycle delay to issue each new memory address. The
overall result is that sequential read accesses aremore eﬃcient than random read accesses
within a page and there is a very large penalty whenever a new page is accessed (relative
to the cost of sequential accesses within a page). Thus the timing behaviour of NAND
flash memory has strong similarities to the behaviour of DRAM memory studied within
this thesis.
An interesting property that is not seen in DRAM memory is the asymmetric cost
of read and write operations. NAND Flash memory writes are split into two phases:
first, a Block Erase phase which must erase all the contents of block (typically 64 pages)
and then further operations to program each page within a block individually. Block
erase operations typically take 2ms and each page programming operation may take a
further 300µs. An eﬀective NAND flash controller must gather write accesses together
to minimize the number of block erase and page program operations, both to achieve
acceptable performance and to avoid excessive ‘wear’ of the NAND Flash device, which
eventually leads to device failure.
All these features mean intelligent scheduling of Flash memory commands is key to
achieving good performance. The techniques developed in Chapter 4 and Chapter 5 for
use with SDRAM memory could easily be adapted for use with Flash memory or with a
hybrid system combining SDRAM and Flash memory. The development of an intelligent
HLS tool which can map logical memories to on-chip SRAM memory and both oﬀ-chip
Flash andDRAMdevices and alsomanage the automatic transfer of data between them is
exciting future goal since such a toolwould be able tomanage data-persistence, enormous




Through each of the research outcomes highlighted in Section 6.1, we have developed
techniques both to characterize and optimize memory performance within nested loop
code kernels. In many programs, these make up a significant proportion of program
run-time. Using our techniques, a two-fold benefit of improved performance and a large
reduction reduction in the compile-time uncertainty over the executable time of programs
can therefore be realised.
It is our hope that this research sparks further interest in developing new static com-
pilation techniques which exploit compile time knowledge of the sequence of accessed
memory addresses to develop sophisticated memory controllers. The tight integration
of such controllers with custom datapaths provides an exciting route to realising the full
potential of future FPGA platforms.
168
Bibliography
[1] J. Garloﬀ, “The Bernstein Expansion and its Applications,” Journal of the American
Romanian Academy, vol. 25-27, pp. 80–85, 2003.
[2] B. L. Jacob, S. W. Ng, and D. T. Wang,Memory Systems: Cache, DRAM, Disk. Morgan
Kaufmann, 2008.
[3] Micron, “DDR2 Datasheet.” http://download.micron.com/pdf/datasheets/
dram/ddr2/256MbDDR2.pdf. Accessed 11th Sept. 2012.
[4] “External Memory Interface Handbook : Optimizing The Controller,” Tech. Rep.
EMI-DG-013-3.0, Altera, June 2012.
[5] “International Technology Roadmap for Semiconductors : Assembly and Packag-
ing.” http://www.itrs.net/Links/2010ITRS/2010Update/ToPost/2010Tables_
AssemblyAndPackaging_FOCUS_E2_ITRS.xls, 2010. Accessed 27th Sept. 2011.
[6] ITRS, “International Technology Roadmap for Semiconductors : Process In-
tegration, Devices and Structures.” http://www.itrs.net/Links/2009ITRS/
2009Chapters_2009Tables/2009_PIDS.pdf, 2009. Accessed 11th Sept. 2012.
[7] G. E. Moore, “Cramming more Components onto Integrated Circuits,” Proceedings
of the IEEE, vol. 86, no. 1, pp. 82–85, 1998.
[8] J. Hennessey and D. Patterson, Computer Architecture : A Quantitative Approach.
Morgan Kaufmann, 6th ed., 2006.
169
Bibliography
[9] P. S. Henry Stracovsky, “Using a Timing-Lookup-Table and Page Timers to Deter-
mine the time Between Two Consecutive Memory Accesses.” US Patent, May 2002.
US 6385708.
[10] S. Rixner, W. J. Dally, U. J. Kapasi, P. Mattson, and J. D. Owens, “Memory Access
Scheduling,” in ISCA ’00 : Proceedings of the 27th Annual International Symposium on
Computer Architecture, vol. 28, (Vancouver, BC, Canada), pp. 128–138, May 2000.
[11] W.-f. Lin, “Reducing DRAM Latencies with an Integrated Memory Hierarchy De-
sign,” inHPCA’01 : Proceedings of the 7th International SymposiumonHigh-Performance
Computer Architecture, (Washington, DC, USA), pp. 301–, IEEE, 2001.
[12] J. Shao and B. T. Davis, “A Burst Scheduling Access Reordering Mechanism,” in
HPCA’07 : Proceedings of the 14th IEEE International Conference on High-Performance
Computer Architecture, (Phoenix, AZ, USA), pp. 285–294, IEEE, 2007.
[13] D. Chavarrı´a-Miranda and J. Mellor-Crummey, “Eﬀective Communication Coa-
lescing for Data-Parallel Applications,” in PPoPP’05 : Proceedings of the 10th ACM
SIGPLAN Symposium on Principles and Practice of Parallel Programming, (New York,
NY, USA), pp. 14–25, ACM, 2005.
[14] K. Skadron and D. W. Clark, “Design Issues and Tradeoﬀs for Write Buﬀers,” in
HPCA’97 : Proceedings of the 3rd IEEE Symposium on High-Performance Computer
Architecture, (San Antonio, TX, USA), pp. 144–155, IEEE, 1997.
[15] “External Memory Interface Handbook : Functional Description (HPC II),” Tech.
Rep. EMI-RM-004-2.0, Altera, June 2012.
[16] “Using External Memory Interfaces to Achieve Eﬃcient High-Speed Memory So-
lutions,” Tech. Rep. WP-01169-1.0, Altera, November 2011.
[17] Stephen T. Novak, Scott Waldron and John C. Peck Jr, “Queue Based Memory
Controller.” US Patent, Dec 2002. US 6496906.
170
Bibliography
[18] L. Johnson, “Improving DDR SDRAM Eﬃciency with a Reordering Controller,”
Xcell Journal, vol. 3, pp. 38–41, 2009.
[19] “ Intel Xeon Processor E5 Product Families Datasheet - Volume One.”
http://www.intel.com/content/dam/www/public/us/en/documents/
datasheets/xeon-e5-1600-2600-vol-1-datasheet.pdf, May 2012. Accessed
27th Sept. 2011.
[20] B. Akesson, K. Goossens, and M. Ringhofer, “Predator : A Predictable SDRAM
Memory Controller,” in CODES+ISSS ’07 : Proceedings of the 5th IEEE/ACM Inter-
national Conference on Hardware/Software Codesign and System Synthesis, (Salzburg,
Austria), pp. 251–256, 2007.
[21] B. Akesson, W. Hayes, and K. Goossens, “Automatic Generation of Eﬃcient Pre-
dictable Memory Patterns,” in RTCSA’11 : Proceedings of the 17th IEEE International
Conference on Embedded and Real-Time Computing Systems and Applications, (Toyama,
Japan), pp. 177 –184, Aug. 2011.
[22] I. Hur and C. Lin, “Adaptive History-Based Memory Schedulers,” in MICRO’04 :
Proceedings of the 37th annual IEEE/ACM International Symposium onMicroarchitecture,
(Portland, OR, USA), pp. 343–354, IEEE, 2004.
[23] C. Ma and S. Chen, “A DRAM Precharge Policy Based on Address Analysis,”
in DSD ’07 : Proceedings of the 10th Euromicro Conference on Digital System Design
Architectures, Methods and Tools, (Luebeck, Germany), pp. 244–248, IEEE, 2007.
[24] Y. Xu, A. S. Agarwal, and B. T. Davis, “Prediction in Dynamic SDRAM Controller
Policies,” in SAMOS ’09 : Proceedings of the 9th International Workshop on Embedded
Computer Systems: Architectures, Modeling, and Simulation, (Samos, Greece), pp. 128–
138, Springer-Verlag, 2009.
[25] E. Ipek, O. Mutlu, J. F. Martinez, and R. Caruana, “Self-Optimizing Memory Con-
trollers: A Reinforcement Learning Approach,” in ISCA’08 : Proceedings of the 35th
171
Bibliography
IEEE International Symposium on Computer Architecture, (Los Alamitos, CA, USA),
pp. 39–50, IEEE Computer Society, 2008.
[26] A. Darte, R. Schreiber, and G. Villard, “Lattice-Based Memory Allocation,” IEEE
Transactions on Computers, vol. 54, no. 10, pp. 1242–1257, 2005.
[27] Q. Liu, G. A. Constantinides, K.Masselos, and P. Y. K. Cheung, “Automatic On-chip
MemoryMinimization for Data Reuse,” in FCCM ’07 : Proceedings of the 15th Annual
IEEE Symposium on Field-Programmable Custom Computing Machines, (Napa Valley,
CA, USA), pp. 251–260, 2007.
[28] E. S. Chung, J. C. Hoe, and K.Mai, “CoRAM:An In-FabricMemoryArchitecture for
FPGA-based computing,” in FPGA ’11 : Proceedings of the 19th Annual International
Symposium on Field Programmable Gate Arrays (J. Wawrzynek and K. Compton, eds.),
(Monterey, CA, USA), pp. 97–106, ACM, 2011.
[29] H. S. Kim, N. Vijaykrishnan, M. Kandemir, E. Brockmeyer, F. Catthoor, and M. J.
Irwin, “Estimating Influence of Data Layout Optimizations on SDRAM Energy
Consumption,” in ISLPED ’03 : Proceedings of the 2003 International Symposium on
Low Power Electronics and Design, (Seoul, South Korea), pp. 40–43, 2003.
[30] M. Presburger, “U¨ber die Vollsta¨ndigkeit eines gewissen Systems der Arithmetik
ganzer Zahlen, in welchem die Addition als einzige Operation hervortritt,” in
Sprawozdanie z I Kongresu metematyko´w slowian´skich, Warszawa 1929, pp. 92–101, 395,
Warsaw, 1930. Annotated English version also available [55].
[31] H.-K. Chang and Y.-L. Lin, “Array Allocation Taking into Account SDRAM Char-
acteristics,” in ASP-DAC ’00: Proceedings of the 2000 Asia and South Pacific Design
Automation Conference, (New York, NY, USA), pp. 497–502, ACM, 2000.
[32] S. Goossens, T. Kouters, B. Akesson, and K. Goossens, “Memory-Map Selection for
FirmReal-Time SDRAMcontrollers,” inDesign, Automation Test in Europe Conference
Exhibition (DATE), 2012, pp. 828 –831, March 2012.
172
Bibliography
[33] A. Khare, P. R. Panda, N. D. Dutt, and A. Nicolau, “High-Level Synthesis with
SDRAMs and RAMBUS DRAMs,” IEICE Transactions on Fundamentals of Electronics,
Communications, and Computer Sciences, vol. E82A, no. 11, pp. 2347–2355, 1999.
[34] C. Alias, A. Darte, and A. Plesco, “Optimizing DDR-SDRAM communications at
C-level for automatically-generated hardware accelerators an experience with the
Altera C2H HLS tool,” in Application-specific Systems Architectures and Processors
(ASAP), 2010 21st IEEE International Conference on, pp. 329 –332, july 2010.
[35] C. Alias, A. Darte, and A. Plesco, “Optimizing Remote Accesses for Oﬄoaded
Kernels: Application to High-level Synthesis for FPGA,” in Proceedings of the 17th
ACM SIGPLAN symposium on Principles and Practice of Parallel Programming, PPoPP
’12, (New York, NY, USA), pp. 285–286, ACM, 2012.
[36] “Automated Generation of Hardware Accelerators With Direct Memory Access
From ANSI/ISO Standard C Functions,” Tech. Rep. WP-AGHRDWR-1.0, Altera,
May 2006.
[37] C. Lengauer, “Loop Parallelization in the Polytope Model,” in CONCUR ’93 :
Proceedings of the 4th International Conference on Concurrency Theory, pp. 398–417,
Springer, Aug 1993.
[38] W. Kelly and W. Pugh, “A Framework for Unifying Reordering Transformations,”
Tech. Rep. UMIACS-TR-92-126.1, University of Maryland, College Park, MD, USA,
1993.
[39] A. Barvinok, “A Polynomial Time Algorithm for Counting Integral Points in Poly-
hedra when the Dimension is Fixed,” in FOCS’93 : Proceedings of the 34th IEEE
Annual Symposium on the Foundations of Computer Science, pp. 566–572, IEEE Com-
puter Society, 1993.
[40] A. Barvinok and J. E. Pommersheim, New Perspectives in Algebraic Combinatorics,
vol. 38 of Mathematical Sciences Research Institute Publications, ch. An Algorithmic
173
Bibliography
Theory of Lattice Points in Polyhedra, pp. 91–147. Cambridge University Press,
1999.
[41] S. Verdooleage, R. Seghir, K. Beyls, V. Loechner, and M. Bruynooghe, “Counting
Integer Points in Parametric Polytopes using Barvinoks Rational Functions,” Algo-
rithmica, vol. 48, pp. 37–66, Mar 2007.
[42] R. M. Karp, R. E. Miller, and S. Winograd, “The Organization of Computations for
Uniform Recurrence Equations,” Journal of the ACM, vol. 14, pp. 563–590, July 1967.
[43] C. Bastoul, Improving Data Locality in Static Control Programs. PhD thesis, University
Paris 6, Pierre et Marie Curie, France, Dec 2004.
[44] L.-N. Pouchet, Interative Optimization in the Polyhedral Model. PhD thesis, University
of Paris-Sud 11, Orsay, France, Jan 2010.
[45] C. Bastoul and P. Feautrier, “More Legal Transformations for Locality,” in Euro-
Par’04 : Proceedings of the 10th International Euro-Par conference, LNCS 3149, pp. 272–
283, Springer-Verlag, Aug 2004.
[46] G. B. Dantzig and B. C. Eaves, “Fourier-Motzkin Elimination and Its Dual,” Journal
of Combinatorial Theory, Series A, vol. 14, no. 3, pp. 288–297, 1973.
[47] A. Schrijver, Theory of Linear and Integer Programming. London, UK: Wiley, June
1998.
[48] P. Feautrier, “Parametric Integer Programming,” RAIRO Recherche Operationnelle,
vol. 22, 1988.
[49] M. Palkovic, Enhanced Applicability of Loop Transformations. PhD thesis, Technische
Universiteit Eindhoven, 2007.
[50] M.-W. Benabderrahmane, L.-N. Pouchet, A. Cohen, and C. Bastoul, “ The Polyhe-
dral Model Is More Widely Applicable Than You Think ,” in CC’10 : Proceedings of
2010 International Conference on Compiler Construction, (Paphos, Cyprus), pp. 283–
303, Springer Verlag, March 2010.
174
Bibliography
[51] D. E. Maydan, J. L. Hennessy, and M. S. Lam, “Eﬃcient and Exact Data Depen-
dence Analysis,” in PLDI ’91 : Proceedings of the ACM SIGPLAN 1991 conference on
Programming language design and implementation, PLDI ’91, (New York, NY, USA),
pp. 1–14, ACM, 1991.
[52] U. Banerjee, “A Theory of Loop Permutations,” in LCPC’90 : Proceedings of the
2nd Workshop on Languages and Compilers for Parallel Computing, pp. 54–74, Pitman
Publishing, 1990.
[53] M. Berry, D. Chen, P. Koss, D. Kuck, S. Lo, Y. Pang, L. Pointer, R. Roloﬀ, A. Sameh,
E. Clementi, S. Chin, D. Schneider, G. Fox, P. Messina, D. Walker, C. Hsiung,
J. Schwarzmeier, K. Lue, S. Orszag, F. Seidl, O. Johnson, and R. Goodrum, “The
PERFECT Club Benchmarks: Eﬀective Performance Evaluation of Supercomput-
ers,” International Journal of Supercomputer Applications, vol. 3, pp. 5–40, 1988.
[54] W. Pugh, “The Omega Test: A Fast and Practical Integer Programming Algorithm
for Dependence Analysis,” Communications of the ACM, vol. 8, pp. 4–13, 1992.
[55] R. Stansifer, “Presburger’s article on Integer Arithmetic: Remarks and translation,”
Tech. Rep. TR84-639, Cornell University, Computer Science Department, Sep. 1984.
[56] P. Quinton and V. V. Dongen, “The Mapping of Linear Recurrence Equations on
Regular Arrays,” Journal of VLSI Signal Processing, vol. 1, no. 2, pp. 95–113, 1989.
[57] A. G. Soufiane Baghdadi and A. Cohen., “Putting Automatic Polyhedral Compila-
tion for GPGPU to Work.,” in CPC’10 : In Proceedings of 15th Workshop on Compilers
for Parallel Computers, July 2010.
[58] L. Lamport, “The Parallel Execution of DO Loops,” ACM Communications, vol. 17,
pp. 83–93, Feb 1974.
[59] M. Wolfe, “Loop Skewing: The Wavefront Method Revisited,” International Journal
of Parallel Programming, vol. 15, pp. 279–293, 1986. 10.1007/BF01407876.
175
Bibliography
[60] P. Boulet, A. Darte, G.-A. Silber, and F. Vivien, “Loop Parallelization Algorithms:
From Parallelism Extraction to Code Generation,” Parallel Computing, vol. 24,
pp. 421–444, May 1998.
[61] J. M. Anderson, S. P. Amarasinghe, and M. S. Lam, “Data and Computation Trans-
formations forMultiprocessors,” in PPoPP’95 : Proceedings of the 5th ACMSIGPLAN
symposium on Principles and Practice of Parallel Programming, (New York, NY, USA),
pp. 166–178, ACM, 1995.
[62] M. Griebl, Automatic Parallelization of Loop Programs for Distributed Memory Architec-
tures. PhD thesis, Facultt fr Mathematik und Informatik, Universitt Passau, 2004.
[63] R. Allen and K. Kennedy, “Automatic Translation of FORTRAN programs to Vector
Form,”ACMTransactions on Programming Languages and Systems, vol. 9, pp. 491–542,
Oct. 1987.
[64] J. R. Allen and K. Kennedy, “Automatic Loop Interchange,” in CC’84 : Proceedings of
the 1984 ACM SIGPLAN Symposium on Compiler Construction, (New York, NY, USA),
pp. 233–246, ACM, 1984.
[65] M. E. Wolf and M. S. Lam, “A Data Locality Optimizing Algorithm,” in PLDI
’91 : Proceedings of the ACM SIGPLAN 1991 Conference on Programming Language
Design and Implementation, (New York, NY, USA), pp. 30–44, ACM, 1991.
[66] M.Wolfe, “More Iteration Space Tiling,” in SC ’89 : Proceedings of the 1989 ACM/IEEE
conference on Supercomputing, (New York, NY, USA), pp. 655–664, ACM, 1989.
[67] J. Xue, “On Tiling as a Loop Transformation,” Parallel Processing Letters, vol. 7, no. 4,
pp. 409–424, 1997.
[68] M. Griebl, P. Feautrier, and C. Lengauer, “Index Set Splitting,” International Journal
of Parallel Programming, vol. 28, pp. 607–631, Dec. 2000.
[69] N. Ahmed, N. Mateev, and K. Pingali, “Tiling imperfectly-nested loop nests,” in
176
Bibliography
SC ’00 : Proceedings of the 2000 ACM/IEEE conference on Supercomputing, (Washington,
DC, USA), IEEE Computer Society, 2000.
[70] K. Beyls, Software Methods to Improve Data Locality and Cache Behavior. PhD thesis,
Ghent University, 2004.
[71] M. Kandemir, J. Ramanujam, and A. Choudhary, “A Compiler Algorithm for Op-
timizing Locality in Loop Nests,” in SC ’97 : Proceedings of the 11th International
Conference on Supercomputing, (New York, NY, USA), pp. 269–276, ACM, 1997.
[72] A.W. Lim, Improving Parallelism andData Locality with Aﬃne Partitioning. PhD thesis,
Stanford University, 2001.
[73] U. Bondhugula, M. Baskaran, S. Krishnamoorthy, J. Ramanujam, A. Rountev, and
P. Sadayappan, “Automatic Transformations for Communication-Minimized Par-
allelization and Locality Optimization in the PolyhedralModel,” inCC ’08 : Proceed-
ings of the 17th International Conference on Compiler Construction, (Berlin, Heidelberg),
pp. 132–146, Springer-Verlag, 2008.
[74] U. Bannerjee, “Unimodular Transformations of Double Loops,” Advances in Lan-
guages and Compilers for Parallel Processing, pp. 192–219, 1991.
[75] A. Darte, R. Schreiber, and G. Villard, “Lattice-based memory allocation,” IEEE
Transactions on Computing, vol. 54, pp. 1242–1257, Oct. 2005.
[76] C. Ancourt and F. Irigoin, “Scanning Polyhedra with DO loops,” in PPoPP ’91 : Pro-
ceedings of the 3rd ACM SIGPLAN Symposium on Principles and Practice of Parallel
Programming, pp. 39–50, ACM, 1991.
[77] F. Quillere, S. Rajopadhye, and D. Wilde, “Generation of Eﬃcient Nested Loops
from Polyhedra,” International Journal of Parallel Programming, vol. 28, pp. 469–498,
2000.
[78] E. Ehrhart, Polynomes Arithmetiques et Methode des Polyedres en Combinatoire, vol. 35
of International Series of Numerical Mathematics. Birkhuser Verlag, 1977.
177
Bibliography
[79] J. A. D. Loera, R. Hemmecke, J. Tauzer, and R. Yoshida, “Eﬀective Lattice Point
Counting in Rational Convex Polytopes,” Journal of Symbolic Computation, vol. 38,
no. 4, pp. 1273 – 1302, 2004.
[80] M. Brion, “Points Entiers dans Les Polyedres Convexes,” Ann. Sci.Ecole Norm. Sup.
21, no. 4, pp. 653–663, 1988.
[81] J. Lawrence, “Polytope Volume Computation,”Mathematics of Computation, vol. 57,
pp. 259–271, 1991.
[82] M. Beck, C. Haase, and F. Sottile, “Formulas of Brion, Lawrence, and Varchenko
on Rational Generating Functions for Cones,” TheMathematical Intelligencer, vol. 31,
pp. 9–17, 2009.
[83] S. Robins andM. Beck,Computing the ContinuousDiscretely. NewYork, NY: Springer,
New York,USA, 2007.
[84] S. Verdoolaege, “Barvinok : User Guide.” http://www.kotnet.org/˜skimo/
barvinok/barvinok.pdf, December 2011.
[85] Altera, “Altera Quartus II Handbook.” http://www.altera.com/literature/
lit-qts.jsp, June 2012.
[86] W. Luk, “Pipelining and Transposing Heterogenous Array Designs,” in ASAP
’91 : Proceedings of the International Conference on Application Specific Array Proces-
sors, 1991., pp. 263–277, Sept 1991.
[87] R. E. Gomory, “Outline of an Algorithm for Integer Solutions to Linear Program,”
Bulletin of the American Mathematical Society, vol. 64, pp. 275–278, September 1958.
[88] T. Drane and G. Constantinides, “Correctly rounded constant integer division via
multiply-add,” in ISCAS ’12 : 2012 IEEE International Symposium on Circuits and
Systems, IEEE, May 2012.
[89] F. de Dinechin, “Multiplication by Rational Constants,” Circuits and Systems II:
Express Briefs, IEEE Transactions on, vol. 59, pp. 98 –102, feb. 2012.
178
Bibliography
[90] C. Bastoul, “Extracting Polyhedral Representation From High Level Languages,”
tech. rep., Paris-Sud University, 2008.
[91] Altera, “DDR2 and DDR3 SDRAM Controller with UniPHY User Guide.” http:
//www.altera.com/literature/hb/external-memory/emi_ddr3up_ug.pdf, June
2011.
[92] Altera, “Altera Stratix III Device Handbook.” http://www.altera.com/
literature/hb/stx3/stratix3_handbook.pdf, March 2011.
[93] C. Bastoul, “Code Generation in the Polyhedral Model is Easier Than You Think,”
in PACT ’13 : IEEE International Conference on Parallel Architecture and Compilation
Techniques, (Juan-les-Pins, France), pp. 7–16, 2004.
[94] H. P. Williams, “The Elimination of Integer Variables,” The Journal of the Operational
Research Society, vol. 43, no. 5, pp. pp. 387–393, 1992.
[95] U. K. Banerjee, Loop Transformations for Restructuring Compilers: The Foundations.
Norwell, MA, USA: Kluwer Academic Publishers, 1993.
[96] L.-N. Pouchet, C. Bastoul, A.Cohen, andN.Vasilache, “IterativeOptimization in the
Polyhedral Model: Part I, One-Dimensional Time,” in Proceedings of the International
Symposium on Code Generation and Optimization, CGO ’07, (Washington, DC, USA),
pp. 144–156, IEEE Computer Society, 2007.
[97] M. E. Wolf and M. S. Lam, “A Data Locality Optimizing Algorithm,” in Proceedings
of the ACM SIGPLAN 1991 conference on Programming language design and implemen-
tation, PLDI ’91, (New York, NY, USA), pp. 30–44, ACM, 1991.
[98] IBM, “Introduction to CPLEX Optimization Studio.” http://www-01.ibm.com/
software/integration/optimization/cplex-optimizer/, June 2010.
[99] C. Ancourt and F. Irigoin, “Scanning polyhedra with DO loops,” in PPOPP ’91 :
Proceedings of the 3rd ACM SIGPLAN Symposium on Principles and Practice of Parallel
Programming, (Williamsburg, United States), pp. 39–50, 1991.
179
Bibliography
[100] S. Skelboe, “Computation of Rational Interval Functions,” BIT Numerical Mathemat-
ics, vol. 14, pp. 87–95, 1974. 10.1007/BF01933121.
[101] P. A. Parrilo, “Semidefinite programming relaxations for semialgebraic problems,”
Math. Program., vol. 96, no. 2, pp. 293–320, 2003.
[102] P. A. Parrilo and B. Sturmfels, “Minimizing polynomial functions,” in Algorith-
mic and Quantitative Aspects of Real Algebraic Geometry in Mathematics and Computer
Science (S. Basu and L. Gonza´lez-Vega, eds.), pp. 83–100, American Mathematical
Society, 2001.
[103] J. Garloﬀ, C. Jansson, and A. P. Smith, “Lower Bound Functions for Polynomials,”
Journal of Computational and Applied Mathematics, vol. 157, no. 1, pp. 207 – 225, 2003.
[104] P. Clauss and I. Tchoupaeva, “A Symbolic Approach to Bernstein Expansion for Pro-
gram Analysis and Optimization,” in CC’04 : In proceedings on the 13th International
Conference on Compiler Construction, pp. 120–133, Springer, 2004.
[105] G. Farin, Curves and Surfaces for CAGD: a practical guide. San Francisco, CA, USA:
Morgan Kaufmann Publishers Inc., 5th ed., 2002.
[106] S. K. Lodha and R. Goldman, “A Unified Approach to Evaluation Algorithms for
Multivariate Polynomials,”Math. Comput., vol. 66, pp. 1521–1553, Oct. 1997.
180
A. Code Listings for Benchmark Examples
A.1. Matrix-Matrix Multiply
i n t A[ 5 0 ] [ 5 0 ] ;
i n t B [ 5 0 ] [ 5 0 ] ;
i n t C[ 5 0 ] [ 5 0 ] ;
#pragma scop
in t x1 , x2 , x3 ;
fo r ( x1 = 0 ; x1 <= 49 ; x1++) {
f o r ( x2 = 0 ; x2 <= 49 ; x2++) {
fo r ( x3 = 0 ; x3 <= 49 ; x3++) {





A.2. Sobel Edge Detection
i n t Out [ 5 9 8 5 ] ;
i n t In [ 6 1 4 4 ] ;
i n t Kernel [ 3 ] [ 3 ] ;
#pragma scop
in t x1 , x2 , x3 ;
fo r ( x1 = 0 ; x1 < 94 ; x1++) {
f o r ( x2 = 0 ; x2 < 62 ; x2++) {
fo r ( x3 = 0 ; x3 <= 2 ; x3++) {
f o r ( x4 = 0 ; x4 <= 2 ; x4++) {
181
APPENDIX A. CODE LISTINGS FOR BENCHMARK EXAMPLES
Out [ x1 −1] [ x2−1] = Out [ x1 −1] [ x2−1] + \







i n t X [ 7 2 ] ;
i n t B [ 7 2 ] ;
i n t A[ 5 1 8 4 ] ; / / An upper t r i angu l a r matrix
i n t x1 , x2 ;
#pragma scop
for ( x1 = 0 ; x1 <= 72 ; x1++) {
f o r ( x2 = 0 ; x2 <= 71 − i ; x2++) {
X[−x1 + 72] = B[−x1 + 71] / A[−73 ∗ x1 + 5183 ] ;





APPENDIX A. CODE LISTINGS FOR BENCHMARK EXAMPLES
A.4. Blocked Gaussian Back-Substituion
i n t X [ 7 2 ] ;
i n t B [ 7 2 ] ;
i n t A[ 5 1 8 4 ] ; / / An upper t r i angu l a r matrix
i n t x1 , x2 , x3 ;
#pragma scop
for ( x1 = 0 ; x1 <= 3 ; x1++) {
fo r ( x2 = 0 ; x2 < 18 ; x2++) {
f o r ( x3 = 0 ; x3 < 71 − 18∗ x1 − x2 ; x3++) {
X[−18 ∗ x1 −x2 + 71] = B[−18 ∗ x1 − x2 + 71] / \
A[−73 ∗ 18 ∗ x1 + −73∗x2 + 5183 ] ;
B [ x3 ] = B[ x3 ] − A[72 ∗ x3 − 18∗ x1 − x2 + 71] ∗ \





B. Evaluating Generating Functions
Here we provide an example of how to evaluate a generating function of the form shown in B.1.














Figure B.1.: Example polyhedron showing seven integer points.
We consider the polytope shown in Figure B.1 which has three vertices, v1 = (0, 0), v2 = (3, 3)
and v3 = (3, 1). The polytope has the generating function shown in (B.2).
184





































(1 − z−11 z−12 )(1 − z−12 )
where K2 is the tangent cone at v2.K11, K21 and K13 are the unimodular cones obtained by the
decomposition of the tangent cone at v1 using Barvinok’s Decomposition [39]. K31, K32 and K33
are unimodular cones obtained by the decomposition of the tangent cone at v3 using Barvinok’s
Decomposition. We want to evaluate at point z = [1, 1] but there is a pole in the denominator at
those points.
Instead, we select λ = (λ1, . . . ,λd) and substitute zi = tλi to form univariate polynomial. We
select λ such that the dot product with all of the cone generator vectors is non-zero.
In our case , we choose λ = (1, 1) and substitute z1 = t and z2 = t into (B.2) which gives (B.3).
#(P; t) =
1
(1 − t2)(1 − t) −
1
(1 − t3)(1 − t) +
1
(1 − t3)(1 − t4) (B.3)
+
t4
(1 − t−2)(1 − t) −
t4
(1 − t−2)(1 − t−1)
+
t4
(1 − t−1)(1 − t−4) +
t6
(1 − t−2)(1 − t−1)
Multiplying each term in (B.3), top and bottom, to remove negative factors from denominators
gives (B.4)
#(P; t) = #(P; t) =
1
(1 − t2)(1 − t) −
1
(1 − t3)(1 − t) +
1
(1 − t3)(1 − t4) (B.4)
-
t6
(1 − t2)(1 − t) −
t7
(1 − t2)(1 − t1)
+
t9
(1 − t1)(1 − t4) +
t9
(1 − t2)(1 − t1)
If we substitute t = s + 1 into (B.4), we get (B.5).
185
APPENDIX B. EVALUATING GENERATING FUNCTIONS
#(P; t) =
1
(1 − (s + 1)2)(−s) −
1
(1 − (s + 1)3)(−s)) +
1
(1 − (s + 1)3)(1 − (s + 1)4) (B.5)
− (s + 1)
6
(1 − (s + 1)2)(−s) −
(s + 1)7
(1 − (s + 1)2)(−s
+
(s + 1)9
(−s)(1 − (s + 1)4) +
(s + 1)9
(1 − (s + 1)2)(−s)
If we then perform a series expansion of each term in (B.5) around s = 0 and gather together
constant terms for each of the cones. we arrive at (B.6) which correctly indicates that there are 7
integer points within the polytope.
#(P; t) =
1
8
− 2
9
+
41
144
− 49
8
− 71
8
+
95
16
+
127
8
= 7
186
