We argue that FPGAs, more than two decades after they began to be used for computational purposes, have become one of the key hopes for extending the performance of computational systems in the era characterised by the end of Dennard scaling. We believe that programmability of future heterogeneous computing platforms has brought a new urgency to bear on several old problems in high-level synthesis for FPGAs. Our focus is on the two areas we believe are most underdeveloped in today's high-level synthesis software: effective utilisation of the numerical flexibility afforded by high-level correctness specifications, and application-specific memory subsystem synthesis. We conclude with our perspective on the likely future evolution of the field.
A Selective Context
We present below a necessarily rather narrow view of the evolution of the FPGA and the microprocessor, highlighting the interaction between the two and the major external drivers.
The Field-Programmable Gate Array (FPGA) was invented in the 1980s, but they developed and matured in the 1990s. Already in the early 1990s, academic conferences started to appear that were largely dedicated to the potential these devices had to implement computation, such as the first FPL, held in 1991 in Oxford. However, the nature of such devices has transformed over recent years. Initially FPGAs were largely homogeneous architectures, consisting of a large number of very fine grain logic cells. Responding to the nature of the new application areas, manufacturers evolved the FPGA architecture, first incorporating larger RAM blocks 1 and then dedicated multiplication logic.
2 Today, modern FPGA architectures are highly heterogeneous devices, containing logic cells, embedded RAM and DSP functionality, high speed transceiver circuitry and microprocessors. It is worth noting that the majority of these components, present as hard IP within an FPGA, could be implemented using lookup table functionality. However, to do so would be either too large, too slow, or consume too much power to be worthwhile. 3 Thus through one lens we can see the evolution of the FPGA in recent years as a conscious decision to move away from having all area devoted to simple fine-grain units and their fine grain interconnect, towards the specialisation of circuitry to perform certain common tasks or classes of tasks.
In a sense, the evolution of the general purpose processor has mirrored the evolution of the FPGA over the same timeframe. Traditional, latencydriven, computer architecture largely consisted of utilising all available silicon in order to keep a single (or small number of) computational units as busy as possible. This resulted in a very large amount of silicon and power consumption devoted to caching, in particular, as well as various micro-architectural innovations to avoid latency-consuming pipeline stalls. 4 These processors have formed the core of general purpose computer design for several decades. The most significant innovation to arise as a result has been the GPGPU, which has delivered major performance improvements in certain domains by explicitly abandoning some of the received wisdom of computer architecture, a process referred to by Bill Dally as 'the end of denial architecture'.
5 GPGPU computing achieves its performance by using an explicitly software-managed memory hierarchy, returning hardware to computation, and using an abundance of threads to hide pipeline stalls. We may therefore view the evolution of the microprocessor as a conscious decision to move away from one complex unit towards dedicating areas to a large number of much simpler units and their interconnect; in a certain sense this is a mirror of the evolution of the FPGA.
It is no accident that the co-evolution of FPGA and microprocessors are now reaching the point of blurred boundaries. On the microprocessor side of the picture, this has largely been driven by the recent failure of Dennard scaling. Dennard scaling 6 provided a road map of how to scale various parameters under manufacturer control, such as supply voltage, in response to the geometric scaling of VLSI given by each processor generation. The recent deviation from Dennard scaling, largely driven by power consumption concerns 7 has forced the general purpose processor industry to look beyond clock frequency as the driver for performance. While high performance for throughput-dominated applications with embarrassing levels of parallelism can be achieved in a direct way using the GPGPU approach, latency constraints and algorithm bottlenecks mandate a more heterogeneous approach. 8 On the FPGA side of the picture, silicon and power inefficiencies combined with new market opportunities have driven the evolution of FPGA architecture to its present heterogeneous state.
The future of manycore computing using traditional -but simple -microprocessor cores is not rosy. In a landmark paper, Berger 9 et al. make a detailed study of the power/area and power/performance tradeoffs available across a spectrum of processor designs. Using predictions of future technologies, even with extremely parallel workloads, the performance to be gained by using more, simpler, traditional processors, is bounded from above by factor of only about 4-8x over the next decade, far slower than the historical trends. This is largely due to power consumption limitations, resulting in the spectre of dark silicon, transistors that may be present on a device but cannot be powered on simultaneously without overloading the power limitations. The conclusion is clear: to move beyond such limits, it becomes necessary to improve the Pareto tradeoff itself, rather than simply move towards more, simpler, processors on the Pareto front. The only clear way to do this is through circuit specialisation; creating parts of a processor that are specialised to particular commonly occurring tasks, avoiding the energy inefficiency in using general purpose architectures for these tasks. This is exactly the area where the reconfigurable computing community has a head start, and can provide direction to the general purpose architecture community.
It seems inevitable that future computer architecture will therefore be programmable, contain elements of application-specific or domain-specific architecture, and be highly heterogeneous in nature. The major challenge is how to efficiently and effectively compile applications onto such platforms. This is a challenge that must be overcome, but one that is no longer faced by the FPGA community alone, as was often the case in the early efforts of high-level synthesis for reconfigurable computing. The coming industrial turn towards heterogeneous parallel computing opens many doors.
The Promise and the Challenges
High level synthesis for reconfigurable computing has made great strides recently. The Autopilot tool 10 now included within the Vivado design suite is a high quality high-level synthesis environment, using C as the input language. Academic efforts such as LegUp 11 also point to a promising future for FPGA-based high level design. However, existing solutions for high-level synthesis do not -in our opinion -adequately address memory systems. It should not be up to the programmer to explicitly manage the transfer of data between external memory of various types (SDRAM, SRAM, etc.) and on-chip memory. Equally, we should not squander the potential of FPGA architectures by aping general purpose microprocessor cache schemes within an FPGA. In our view, it is high time that tools for customisation of computational circuitry were matched by aggressive tools for customisation of memory subsystem design. By pushing the complexity into the synthesis tool, we believe that significant performance advantages can be obtained without the area or energy overhead of caching schemes. This is the topic we consider in Section 1.4. We note that the degree of predictability of memory accesses largely defines the potential improvement possible by a customised memory system and that often memory accesses are very predictable in nature, especially for embedded applications, which we believe form the key driver for next-generation computer architecture. The other area that is poorly covered by existing high-level design flows is the automation of the selection of numerical representation and precision. The designer of any hardware accelerator for a numerically-intensive algorithm knows that this is one of the areas where customised logic can result in huge performance gains, and will naturally ask 'should I use floating point, fixed point, or some more esoteric number system to perform this task', 'how precise do my internal results really need to be', etc. These questions remain largely unautomated. As a result, designers will again often ape the systems used in general purpose processor designs, such as IEEE standard floating-point arithmetic as the 'gold standard' of real number representation. There are two problems with this approach. Firstly, it does not work: typically a designer will want to perform operations in a different order to that expressed in the original code, in order to improve hardware efficiency, for example by applying the associative law to regroup addition into a tree structure:
, a law that holds for real numbers but does not hold for floating-point thus raising questions of correctness. Usually, whether formalised or not, there will be some notion of an acceptable numerical result, which can be used to drive such decisions; indeed, without such a notion, it becomes impossible to demonstrate that the behaviour of even the original source code is acceptable. We strongly advocate the formalisation of such specifications. This leads us onto the second problem: the designer operates with 'one hand tied behind her back' by being forced to replicate the hardware structures present in general purpose processors, which may be grossly inefficient for the problem at hand. Once a formal specification of numerical correctness is available, the designer, and the synthesis tool, should be free to produce any hardware structure meeting that specification, playing to the advantages of the underlying architecture. Thus the same algorithmic specification may map automatically to a mixed-precision implementation in a GPU, a double precision implementation in a CPU, and a fixed-point implementation in an FPGA. No two of these implementations may produce the same bit pattern at their outputs, but all should be verifiable with respect to the formal correctness criteria. This is the topic we consider in Section 1.3. We note that, while such freedom can be exploited in all numerical applications, the degree of freedom is particularly great in embedded applications, where specifications of correctness tend to be expressible at very high levels of abstraction, leaving lots of freedom for an advanced design tool to explore, e.g. a controller for an aircraft might mandate stability and minimisation of fuel consumption of the aircraft;
12 a much higher level of abstraction than bit-level equivalence to a golden C model!
Numerical Behaviour
When creating digital hardware architectures, one must first select a finite precision number system to represent numerical data. Since this number system can only represent a subset of real numbers, rounding will often occur after an arithmetic operation so as to represent values using the chosen number system. Whilst the error introduced by the rounding of any single value may be small, over the course of an algorithm the accumulation of these errors can cause a significant deviation from the desired result.
A simple tactic to minimise this error would be to err on the side of safety and select a number system that has much greater precision than necessary to obtain the desired quality of output, if such a precision can be determined. However, this will come at a substantial cost in terms of performance. As an example, recent figures for the difference in performance, in terms of peak theoretical FLOPs, between single and double precision is approximately a factor of 2 to 3 for a CPU 13 or 24 for a GPU.
14 Since arithmetic computation forms the heart of many high-performance digital systems, if we are to create efficient hardware accelerators, we first need to select number systems with the minimum precision necessary to guarantee that our design criteria are met. Unlike CPUs and GPUs, FPGAs offer the freedom to fully customise the precision used throughout an accelerator. As a result, development of techniques to select an optimised number system have been an extensive research topic for the FPGA community over the past decade. 15, 16 In this section, we first describe the state-of-the-art techniques that help us guarantee that a given number system for a hardware accelerator satisfies a numerical correctness criterion. We further discuss how these techniques can be enhanced so that they are applicable to a wide range of algorithms. Finally, we outline some of the future challenges for research in this field.
Bounding Numerical Errors
The most straightforward way to estimate the error of any hardware accelerator is through simulation; indeed, this is the main technique used by industry. Unfortunately, the size of the search space for the inputs will generally be too large to explore exhaustively; this means simulation may miss corner cases and under-allocate the number of bits for an accelerator. This is unacceptable in any safety-critical system, and in any case only works when there is a trusted, 'golden' reference model or method of certification available.
In contrast, analytical approaches provide guarantees that a design criterion will not be violated. Early analytical approaches were based on Interval Arithmetic (IA), 17 Affine Arithmetic (AA) 18 and LTI Theory.
15
Unfortunately, because IA and AA cannot find tight bounds on the worst case error, they will typically over-allocate bits for any nontrivial example. LTI theory is powerful enough to compute tight bounds, but it is restricted to the LTI domain and this does not include general multiplication, for example.
More recently, new approaches have been created which involve constructing polynomials to represent the worst-case range of intermediate variables throughout an algorithm. Through computing the lower (γ lower ) and upper (γ upper ) bounds of these polynomials, we can select a number system which prevents overflow. Furthermore, if we first construct a polynomialp representing the range of every intermediate variable in the presence of finite precision errors and a second polynomial p representing the range in infinite precision, then the extrema of the function |p−p| p represent the worst-case relative error introduced by the use of finite precision arithmetic.
To create these polynomials, we use standard models to represent finite precision errors. When using fixed point, provided there is no overflow, numerical errors are limited to one unit in the last place. If we choose an η-bit number system where the maximum value is 2 X , the worst-case rounding error for any fixed point number x is given by (1.1). It follows that the result of any scalar operation ( ∈ {+, −, * , /}) is bounded as in (1.2) . Similarly, for floating point, provided there is no overflow or underflow, for any real value x, the closest floating-point approximationx of x can be expressed as in (1.3) , where η is the number of mantissa bits used. Once again, it follows that the floating-point result of any scalar operation ( ∈ {+, −, * , /}) is bounded as in (1.4).
Through applying these models of error to every computation in an algorithm, we can construct polynomials that represent the potential range of every intermediate variable. This is shown for a simple example in Table 1.1. 
While constructing these polynomials is straightforward, finding their extrema is computationally intractable. Instead, algorithms focus on finding a computationally tractable lower boundγ lower ≤ γ lower and upper boundγ upper ≥ γ upper . Ideally we wish to find bounds such that γ lower −γ lower andγ upper − γ upper are as small as possible.
One of the latest and most powerful techniques to achieve this is based upon a result from real algebra discovered by Handelman. 19 This states that a polynomial p is non-negative if and only if p has a Handelman representation of the form (1.5).
where each cα is a positive constant, each gi is a positive inequality and N is the set of natural numbers.
Using this result, we first re-write the bounds of a polynomial γ lower ≤ p ≤γ upper as two separate equationsγ lower − p ≥ 0 and p −γ upper ≥ 0. If we can find a Handelman Representation to prove each inequality is non-negative, then we have found the lower and upper bounds of the polynomial. Heuristics which search for these representations have been shown to be able to compute much tighter bounds than IA or AA and enable us to create substantially smaller hardware. 
Can we apply these techniques to general code?
The techniques described in the previous section are powerful and have been shown to result in substantial performance improvements for some simple benchmarks. Unfortunately, the size of the polynomials can grow exponentially in the number of operations, meaning it would become too time consuming to be applicable to real benchmarks.
However, we can simplify large polynomials by replacing all terms that contribute little to the final result with a single term. Table 1 .2 analyses a polynomial representing the range of a floating point addition of two variables. It calculates the worst-case range of every individual term in this polynomial. Clearly, several terms such as 100y 1 δ 1 , 100x 1 δ 1 and 100x 1 y 1 δ 1 will have little impact on the final bounds. As such, if we replace them with a single new bounded variable, we shrink our polynomial with little impact on the final bounds. This simplification technique enables the earlier bounding procedure to be applied to much larger algorithms consisting of straight-line code. However many algorithms cannot be converted into straight-line code; algorithms often contain complex control structures such as 'while' loops. The challenge with these structures is that finite precision errors may cause a 'while' loop to fail to terminate. 16 Interestingly, these polynomial bounding procedures can also be useful in choosing sufficient precision to ensure that 'while' loops terminate.
One technique to prove program termination is based on the following steps:
(1) Construct a ranking function, 22 f (x 1 , ..., x n ) that maps every potential state within the loop to a positive real number. (2) Prove that for all potential values of the variables x 1 , ..., x n within the loop body, when the ranking function is applied to the loop variables before and after the loop transition statements, it always decreases by more than some fixed amount > 0, i.e. f (
If we note that proving a ranking function decreases (f (x 1 , ..., x n ) ≤ f (x 1 , ..., x n ) − ) can be re-written as a question of non-negativity (0 ≤ f (x 1 , ..., x n ) − f (x 1 , ..., x n ) − ), then we can apply the same techniques that prove non-negativity to prove termination in finite precision arithmetic. 16 
Next Steps
The techniques described in this section only touch the surface of research into automatically selecting the minimum precision necessary to meet design criteria. However, crucially they offer substantial progress in answering the following question: given a hardware architecture and word-length specification, will my design satisfy the specification? This will enable further research in this field; this includes delving deeper into techniques to assign the word-length for each individual operator in a large datapath and minimise the total area consumption, 23, 24 studying the links between how the order of operations in a hardware datapath can affect the error seen at the output and exploring the relationship between numerical precision and termination of iterative algorithms. Research into numerical behaviour has entered exciting times.
Memory Systems
In the preceding sections, we described how high-level specification of numerical accuracy can enable us to make more efficient use of silicon area. This in turn enables better performance where the number of parallel processing units can be increased, provided those units can be efficiently fed with data.
The most area-efficient technologies in common use for implementing commodity memory today (DRAM and Flash) have optimal process parameters that conflict with those needed to build fast logic. Wherever applications require large amounts of memory, that memory is implemented using a separate memory die. However, because device pin-density and off-chip switching frequencies have not scaled as rapidly as the exponential growth of transistors dedicated to logic datapath implementation, external memory bandwidth has increasingly become a performance bottleneck.
So it is critical to ensure that limited off-chip memory bandwidth is used efficiently. Herein lies a second challenge. DRAM memory structure is arranged in banks, rows and columns. Each row must be 'activated' before data held in columns within that row can be read or written. The row must then be 'precharged' before data in another row can be accessed. Timing parameters determined by physical DRAM memory array architecture constrain the minimum time between successive row activations. Over time, increasing memory clock frequencies mean that there is an ever larger penalty paid for random access to DRAM memory. In practical terms, this means there is a greater than 10× performance difference between the worst case and best case memory bandwidth obtained through different memory address sequences.
This has made it essential to develop memory subsystems which exploit the locality of memory accesses to provide the illusion of fast access to large amounts of memory. In a CPU, caches and dynamic memory controllers buffer and reorder memory requests to help ensure this happens. They typically must assume no prior knowledge of the sequence of memory requests from the datapath. Furthermore, CPUs implement non-deterministic bus interfaces which make memory performance difficult to analyse.
Where a memory system is implemented in reconfigurable hardware, it can be customised for a specific application. Three key benefits can then be realised:
(1) fine grained on-chip memories provide a very large on-chip memory bandwidth to customised datapath, (2) data buffered in those memories can be reused, reducing off-chip memory bandwidth requirements, and (3) off-chip memory requests can be reordered to make the most efficient use of limited bandwidth.
The most memory intensive parts of a program tend to be in loops, so we target nested loops in our work.
25 Static analysis to model the sequence of memory accesses which occur in nested loops can be done using the Polyhedral Model. 26, 27 From this analysis, automated tools allow us to synthesise a high performance application-specific memory system. In Section 1.4.1, we provide a brief overview of the Polyhedral Model. Section 1.4.2 describes a way in which this model can be used to build high performance application-specific memory systems.
What is the Polyhedral Model?
The Polyhedral Model represents a set of loop iterations as those integer vectors which satisfy a finite set of affine inequalities. The code in Figure 1. 1 shows a two level nested loop. The set of loop iterations is described by upper and lower bounds which are affine expressions of the surrounding loop variables (x 1 and x 2 ). The iterations of an n-level loop nest can be described implicitly as an integer set {x ∈ Z n | Ax ≤ b } where A is a 2n×n integer matrix, b is a 2n integer column vector and the vector inequality is interpreted as x ≤ y iff x i ≤ y i for all i. These iterations can be scheduled according to a linear mapping function which determines a partial ordering of those iterations. For the example given in Figure 1.1, a mapping 
Code that fits into this form is common in video processing and dense linear-algebra applications. Exact dependence analysis for code which can be described in this way is often tractable using integer linear programming techniques. 28 The Polyhedral Model gives us a formal mathematical representation of the sequence of memory addresses accessed in the program. In section 1.4.2, we show how transformations applied to that formal representation can help build a high performance memory system.
Building high performance memory systems
In the preceding section, we showed how we could formally characterise the memory access requirements within a nested-loop structure. We can use this information to decouple the off-chip memory accesses from datapath logic using on-chip memory buffers. If we can transform code so that data is reused from the on-chip memory buffer, we can reduce the number of accesses to off-chip memory.
We can represent the specific 'row' and 'burst' accessed in each memory request by adding new dimensions to the loop-nest representation. If the size of each DRAM row is R words, the row accessed by memory address fx + h is given by r = fx + h div R = fx + h/R where · represents the floor function. The columns within each row can be represented as non-overlapping bursts to take advantage of the multi-word burst accesses supported by modern memory devices. These can be represented by u = fx − rR)/B where a burst is B words long.
While neither of these is directly amenable to linear algebraic representation, we may note that from the properties of the floor function: 6) and
We can rewrite (1.6) and (1.7) as linear equalities as shown below in (1.8) and (1.9), without loss of information.
We can then add these four extra inequalities to those already present defining the loop bounds. This forms, for each memory reference, an augmented system of linear inequalities that completely capture not only the iteration space but also the specific SDRAM rows and bursts accessed within the innermost loop.
Using standard unimodular loop transformations, we can transform this augmented polyhedral representation to expose those occasions where data items are reused by multiple loop iterations. After transformation, those redundant dimensions which only represent data reuse can be projected out of the resulting polyhedral representation to produce code which fetches each memory item only once from off-chip memory. From this representation, we can use standard loop reordering transformations to move 'row' dimension to the outer-most level of the loop nest, improving data-locality. When this technique is applied to code, it can significantly improve interface bandwidth efficiency. We show this in Figure 1 .3 for three benchmarks (Matrix-Matrix-Multiply, Sobel Filter and Gaussian Backsubstitution) parameterised with reuse buffers inserted at different levels of the loop nest. The insertion of the buffer at the outermost level of the loop nest (t=1) allows reordering of all memory accesses and means less than 10% of memory access cycles are spent idle whilst DRAM rows are swapped compared with >75% in the original code. The different levels of parameterisation allow a trade-off between performance and the amount of on-chip memory dedicated to data buffering. 
What might this enable us to do in the future?
Looking beyond our existing work, the formal model of memory access provided by the Polyhedral Model is a promising representation for enabling other application-specific memory transformations. One emerging area of research is the exploration of how the Polyhedral Model enables the overlapping of off-chip memory operations with on-chip computation. This work makes use of mathematical advances 29 which allow us to count the exact number of integer points contained with a polyhedron without enumerating them.
Knowledge of the exact lifetime of variables fetched into on-chip memory can enable more compact mapping of those variables into limited on-chip memory. Exploratory work on how to better utilise the multiple independent banks within a DRAM also seems like a promising direction, allowing us to further improve the efficiency of off-chip memory accesses.
Exact dependency analysis allows auto-parallelisation, but to support this, we need to ensure that enough on-chip memory ports are available to avoid contention. Emerging automatic array partitioning techniques 30 ensure that contention for on-chip memory ports in minimised. This allows efficient use of the large on-chip bandwidth provided by block RAM resources which are ubiquitous in modern heterogeneous FPGAs.
The key theme is that there are significant opportunities opened up by expanding our synthesis tools to target complete reconfigurable systems including off-chip memory. The formal representation of memory access sequences provided by the Polyhedral Model allows tools to automatically produce efficient application-specific hardware with tailor-made memory systems.
Conclusion
Our view is that many problems in high level design automation, once of concern to the small group of pioneers of reconfigurable computing, now arise in various guises in the much broader setting of computing generally, and embedded computing in particular. While high level synthesis and compilation tools have progressed significantly over the past decade, we believe that there are two very significant gaps in existing tool flows: customisation of memory systems and auto-generation of finite precision arithmetic implementations. We have described our own approaches to these central problems. Our view is that the FPGA computing community is poised to play a central role in the evolution of computer architecture and compilers over the next decade. We must take up the baton.
