This methodology paper addresses high-performance highproductivity programming on spatial architectures. Spatial architectures are efficient for executing dataflow algorithms, yet for high-performance programming, the productivity is low and verification is painful.
Introduction
In this paper, we address high-performance highproductivity programming on spatial architectures. A spatial architecture is composed of (many) distributed hardware resources, including memory blocks, arithmetic/logical elements and their interconnections. The arithmetic/logical elements execute whenever their input data are available. This assumption covers a wide spectrum of spatial architectures, from a fine-grain Field-Programmable Gate Array (FPGA) [2] to a Coarse-Grain Reconfigurable Architecture (CGRA) [3~7, 25] .
In contrast with a temporal architecture (i.e. Von-Neumann architecture like CPUs or GPUs) that uses a global instruction pointer to fetch instructions from memory and then executes the instructions through a fixed pipeline, a spatial architecture has no instruction pointer or instruction fetch. Instead, instructions are directly synthesized into pipelines. This specializes the spatial architecture to match a specific computation, and thus presents better power-efficiency advantages over general purpose CPUs or GPUs.
Spatial architectures are usually used as special-purpose accelerator devices for dataflow algorithms. Dataflow algorithms are driven by data availability, which enables massive parallelism for high performance.
Performance and productivity, however, are conflicting goals. Table 1 describes several high-profile workloads that are representative of various domains and compute patterns. They include SGEMM (single-precision matrix multiply), PairHMM (Pair Hidden Markov Model), a neural network (VGG-16, mainly the convolution and ReLU layer), SpMV (Sparse-matrix dense-vector multiply) and merge sort. Figure  1 shows the engineering time spent on high-performance programming of these workloads on an FPGA or CGRA, written in several languages. The time is collected from 2 companies and 1 school based on their real-world products and research. As we can see, the productivity to achieve high performance is low. It often takes several to tens of months 1 .
On temporal architectures, there is such a conflict between performance and productivity as well. But significant progress has been made to address the conflict from all perspectives of languages, compilers, libraries, runtime, auto-tuning tools and hardware [10, 11, 26, 31, 32, 51] . Particularly, the Halide language [11] well addresses the conflict in the domain of image processing, and in general, dense matrix computation.
A Halide program is a specification, including an algorithm and a schedule. The algorithm expresses a problem in a dataflow function. The schedule specifies how to optimize the function to run on hardware. Figure 2 (a) illustrates the concept with a simple example, where B is a 1-dimensional (1-D) floating-point matrix, f is an arbitrary function, and i is a loop index variable iterating with unit step from 0 to some upper bound (not included). B(i) refers to the i'th element of matrix B. And f(B(i)) returns the value of function f for input B(i). The code is to compute A(i) as f(B(i)), as specified by the algorithm in Line 1~4. Line 5 is the schedule that says loop i should be fully unrolled. A compiler accepts the specification, constructs the initial loop shown in Figure 2 (b) according to the algorithm, and performs unrolling according to the schedule. In general, Halide enables an expert programmer to separate the concerns of functionality and optimizations, and succinctly control a compiler to perform loop-nest optimizations. It has gained remarkable success in image processing on temporal architectures, mainly CPUs and GPUs.
Naturally, we wonder if we could adapt the philosophy of Halide, but move from temporal programming to spatial programming. A latest attempt in this direction, Halide-HLS [12] , extends Halide to target FPGAs. In a graph of (different) functions, a programmer specifies a sub-graph of functions to offload to an FPGA. A programmer can optimize the communication between the functions by specifying the usage of line buffer. Each function is optimized in standard loop nest transformations and its mapping to the spatial architecture is decided by an HLS compiler automatically.
Our focus is different. We would like to exactly control the spatial mapping of a single function. This is a common practice in HPC programming, and is challenging, as one can see from the engineering efforts shown in Figure 1 .
Usually, the workloads to accelerate on spatial architectures are simple. For example, all the workloads in Figure 1 can be simply expressed in 1 or a few mathematical equations, originally. However, a high-performance implementation can be incredibly complicated by optimizations, and have many spatial "pieces". One may get an intuition of the complexity by glancing at Figure 3 (b) without understanding the details, where the workload is mathematically defined by one simple matrix expression, C = A * B, while the high-performance implementation adds 9 helper functions, and constructed 4 systolic arrays and 2 usermanaged data caches. They are used to realize many loop and data transformations (Section 6.2).
Our observation is, no matter how complicated an implementation is, every spatial piece of it must be realizing a part of the functionality of the original workload, and they are communicating based on production-consumption 2 (1) The CGRA is similar to that in [7] . The FPGAs used are Stratix V for SpMV, and Arria 10 for the others.
(2) Since coding and verification phase are often intertwined, when estimation is difficult as in the case of PairHMM and VGG-16, their time is equally divided between the two. (3) For SpMV, performance tuning did not take any significant time, as the design was carefully crafted for high-performance for the first place. The implementation was during an internship and on prototypes of the first generation of Catapult card. A lot of verification time was due to bugs or immaturity of the underlying stack, e.g. DRAM controller. An implementation on today's mature Catapult stack would be much easier. (4) For SGEMM on FPGA, a break-up of the engineering time is not available, but most time was spent on verification. This commercial implementation took a lot of learning cost back then.
Today, if without the learning cost, the total engineering time is estimated to be about 5 months. In T2S, a programmer specifies a temporal definition and a spatial mapping. The temporal definition defines the functionality of the original workload, while the spatial mapping defines how to decompose the functionality and map the decomposed pieces onto a spatial architecture. The specification precisely controls a compiler to actually implement the loop and data transformations specified in the mapping. This approach has the following advantages:
Highly productive coding.
A specification is high-level and succinct. Many strategic loop and data optimizations can be specified easily. For a complex design with many loop and data optimizations, as is the common case in high-performance programming, writing a specification is far more productive than direct coding, which can be seen from the success of Halide [11] in HPC programming of CPUs and GPUs for image processing. Inspired by Halide, T2S introduces the same high productivity to spatial programming. A T2S specification, even for a complicated design, usually takes around 10s of lines (Section 6), which we expect an expert programmer to be able to write in hours instead of months.
Enabling systematic expression of high-performance designs via separation of concerns.
A high-performance spatial design can be specified in well-defined loop and data transformations. In every specification, we separate the concerns of temporal definition and spatial mapping. As one can see from Figure 3 (a) without understanding the details, a temporal definition is a dataflow function, equivalent to an unoptimized, sequential loop nest. A spatial mapping is composed of 3 parts: (1) loop transformations, which expose data locality and parallelism, (2) building a basic spatial layout, where the functionality of the transformed loop nest is split into pieces, i.e. spatial computations, connected based on their producer-consumer relationship, and (3) specializing every spatial computation, where more data and loop transformations can be applied independently.
Correctness guarantee (against the temporal definition).
A specification focuses on transforming loop nests and matrices. At such a high level, the specification lends itself to the compiler for automatic, static verification [24] . The verification ensures that the spatial mapping is semantically equivalent to the temporal definition 3 . This gets rid of the huge burden of verification from the programmer.
To understand the significance of the above advantages, let us look at Figure 1 again. As we can see, coding and verification together consume most of programmers' time, regardless of workloads, languages and platforms. Although our data set is small, this point likely holds in general, and is in accordance with our experiences. Verification is especially painful for software programmers: the debugging support of spatial architectures tend to be primitive, and the code is complicated by optimizations, and further complicated by timing, communication, and non-deterministic behavior. Even though performance tuning appears to be the most time-consuming phase in several case studies, it searches for the best configurations of the designs, and thus is mainly consuming machine time. It can be reduced by parallel search with more hardware resources or by auto-tuning [51] . So performancetuning is not really a critical issue. Therefore, by dramatically reducing the cost in coding and verification, T2S can fundamentally improve the productivity of high-performance programming on spatial architectures. We expect the coding and verification phase to be reduced from months down to hours.
T2S is generally applicable to dataflow algorithms. As long as a workload can be structured into a dataflow representation, T2S can be applied to express it (Section 4).
The paper proposes a new programming methodology. Implementing and evaluating it is the next step. We expect this paper will inspire a series of innovations in the fields of spatial programming languages and compilers.
In this paper, we assume a high-performance spatial design is provided, and our problem is how to express it. How to come up with such a design is beyond the scope of the paper, but a general principle is to fully exploit memory bandwidth and maximize computation to communication ratio [17~20] .
We remark that faster evolution of designs may be enabled: with the dramatic reduction of coding and verification cost, one does not have to come up with a best design upfront, but can start from a simple design, and evolve into a highperformance one. We leave this as a future research.
Below we may refer to a "spatial architecture" or "(accelerator) device" interchangeably.
The remaining of the paper is organized as follows: Section 2 motivates the T2S approach with a simple example. Section 3 introduces the core language features and some useful extensions. Section 4 discusses the applicability of the T2S approach in terms of so-called 13 "dwarfs" (common HPC compute patterns) [22, 23] . Section 5 describes the required compiler technology. Then in Section 6, we illustrate T2S with the several workloads shown in Table 1 . Section 7 compares T2S with related wok, and we reach a conclusion in Section 8.
Motivation
Let us look at Figure 2 In Figure 4 (a), we decompose the original loop quivalently into 3 pieces: They are all copies of the original loop, but specialized to implement only part of the original loop's functionality. The first piece loads B values from memory and feeds them into a 1 st channel --a first-in-first-out (FIFO) hardware structure on the device for fast communication; the channel is read by the middle piece for computation, which writes the resulting A values into a 2 nd channel; then the 2 nd channel is read out by the third piece, and the A values are stored to memory. The three pieces are running in parallel, each making forward progress whenever there are input data available.
If we name the three pieces as B_provider, A, and A_consumer, the basic spatial layout between them is shown in Figure 4 (b), where every arc represents a channel.
Note that the above process preserves semantics, and is transforming a temporal computation to spatial computations.
The corresponding T2S code can be shown in Figure 4 (c). Line 1~5 specify a dataflow function to compute A. It is similar to Line 1~4 of the Halide code in Figure 2 (a), except we declare and set the loop bound I explicitly.
Line 6~7 form a spatial mapping by separating out the producer of B and consumer of A via two new directives "isolate_producer_chain" and "isolate_consumer_chain". The new language features will be clear in the next section.
At this moment, one can intuitively understand the code by comparing it with Figure 4 Although not shown, the individual code pieces could be further transformed independently, as long as the semantics are still maintained. We will illustrate this point with realworld high-performance designs (Section 6).
Language features and usages
A specification is composed of directives that tell a compiler to represent and transform a computation. A small set of core directives can express many workloads, as listed in Table 2 . Some mini-examples are embedded inside to help understanding. We inherit several directives from Halide, including Parameter, Var, Func, tile, reorder, unroll and vectorize. The other new directives we introduce are highlighted in bold fonts.
Below we introduce several important usages of the new directives. Section 6 will use real-world designs to illustrate all the usages.
Static optimizations and dynamic checking.
Assumptions serve two purposes: static optimizations for performance and dynamic checking for correctness. The compiler may generate better-optimized code statically using the assumptions. For example, the compiler can generate cleaner tiled loops if the loop bounds are multiples of the tile sizes, as can be indicated by using the "divisible" directive. Using the assumptions, for correctness, the compiler may also generate host code that will check if the assumptions hold before offloading a workload to a device, to ensure meaningful results will be returned.
Enabling auto-tuning.
A symbolic constant, as declared in an assumption, is used for performance tuning, whose value can be passed to the compiler by, for example, an auto-tuner.
Constructing user-managed data cache hierarchy.
Creating a user-managed data cache is important for performance, especially for spatial architectures that have no hardware caches, like FPGAs. However, it is not easy to construct a cache manually. We make it easy by specifying a buffer at a given loop level. The compiler can automatically determine the buffer size and the read and write address functions (Section 5). For example, A.buffer(B, i) tells the compiler to build and initialize a buffer at loop level i for any B values used inside loop i. All the reads/writes of B inside loop i are redirected to reads/writes to this buffer.
Creating buffers at more than one loop level will let the compiler to build a multi-level data cache hierarchy. 4. Creating systolic arrays 4 .
Unrolling, followed by data forwarding, tells the compiler to construct a systolic array. Unrolling creates a set of hardware processing elements (PEs): For a loop nest under consideration, unrolling n (n ≥ 1) loops of it, whose number of iterations are N0, N1, …, Nn-1, respectively, results in N0 * N1 * … * Nn-1 number of PEs, each with a unique identifier (vector). Data forwarding connects the PEs by letting a PE sending data to another. With these explicit directives, the compiler implementation becomes easier: unlike a traditional compiler, a T2S compiler does not have to figure out an optimal partition [21] , or how to localize dependences [29] , as they have been replaced here with explicit unrolling and data forwarding, respectively.
Avoiding redundant memory accesses.
It is important to minimize memory accesses, and thus increase the computation to communication ratio, for high performance. Loop removal removes a loop from a loop nest that contains memory references. The removed loop's index variable must not be used in the memory references, and thus the removal of the loop avoids redundant memory accesses. However, it breaks the semantics of the original loop nest. Programmers need minor changes in the consumer side of the accessed values so that the semantics keeps unchanged. 6. Building a basic spatial layout.
Isolating a producer (or consumer) chain is to build a basic spatial layout. Usually, only a single producer (or consumer) is needed. However, if the produced (or consumed) data need to be preprocessed (or post- For high performance, randomly accessing the external memory of a device for input or output data is not desirable.
For input data, a general strategy is to serialize the data on a host CPU, transfer the data to the device through a memory channel, and let the device access the data sequentially. Further, a user-managed cache can be created to store the data in the internal memory of the device, and from there, the data can be randomly accessed. For output data, the strategy is similar, except that the data flow from the device to the host sequentially, and it is the host to de-serialize the data into the host memory. Beside the core directives, it can be beneficial to add some extensions for efficiency. For example, Table 3 lists some useful extension so that one may control what resulting data to store to memory, how deep a register channel is, and the usage of a line buffer.
Applicability
To understand the applicability of T2S, we looked at common compute patterns in HPC, the so-called "13 dwarfs" [22, 23] . We can classify them into 3 classes: 1. Compute patterns that are naturally amenable to dataflow representations. These patterns include dense linear algebra, dynamic programming, structured mesh, N-body, Monte-Carlo, spectral methods, and circuits. These patterns contain loops that can be easily tiled and unrolled into hardware PEs [1, 13, 14, 40~42] . These PEs are either independent or connected locally. We will exemplify the application of T2S to dynamic programming and structured mesh with Smith-Waterman (Section 6.1), and the application to dense linear algebra with SGEMM (Section 6.2) and convolution and ReLU (Section 6.3.1).
Compute patterns that may be made as dataflow.
These patterns include sparse linear algebra, graph algorithms, graph models, unstructured mesh, and backtrack-branch-bound. These patterns are not natural match of dataflow representations, as they contain indirect memory accesses, data-dependent parallelism, and behavior that can only be described imperatively. However, for each of these patterns, we can find a highperformance case study using an overall dataflow structure on FPGAs (We use FPGAs as a proxy of spatial architectures) [15, 39, 43, 44] . For example, for SpMV, a high-performance design [15] preprocesses a sparse matrix on the host; with the preprocessed matrix as input, the workload on the device side becomes much like a regular dense matrix computation. So with careful designing, these compute patterns may be mapped to dataflow structures. Often, there are PEs in a structure that have to be programmed imperatively, for example, PEs that have internal states. As long as such PEs can be encapsulated in a dataflow interface, and their internal states are not exposed outside, the PEs can be safely used in the dataflow structure 6 . We will exemplify the application of T2S to sparse linear algebra with SpMV (Section 6.3.2), and the application to graph algorithms with merge sort (Section 6.3.3).
Finite state machine.
This is an imperative pattern, and not suitable for dataflow, unless it is encapsulated entirely as a PE. Therefore, for any compute pattern to run on a spatial architecture, the real question one should ask is how to map it to a dataflow structure. As long as a dataflow structure is defined, T2S is generally applicable.
Compiler
As we said, a T2S program has a temporal definition and a spatial mapping. The compiler scans the T2S program sequentially, and thus reads the directives and implements them one by one. The temporal definition appears first. Accordingly, the compiler builds up an initial IR, i.e. an abstract syntax tree (AST) representing an un-optimized loop nest.
Then the spatial mapping appears, and is processed by the compiler as follows: 1. Transforming the loops in the IR.
According to the loop transformation directives, the compiler transforms the IR. This results in an optimized loop nest that will be used as the basis for every spatial computation to create next.
Building a basic spatial layout.
With the directives for isolating producers or consumers, the compiler splits the IR into multiple connected computations based on the producer-consumer relationship.
As we see from Row 11~12 of Table 2 , a connection is always a FIFO, i.e. a register or memory channel. A register channel is on-chip and is a widely used mechanism in HLS [46] . The compiler can automatically determine its depth. A memory channel can be constructed by sequentially writing to and reading from a virtually shared memory between the host and device. This virtually shared memory is also constructed by the compiler. 3. Specializing the individual computations. 7 The IR shown here is slightly different from Halide: We assume row-major instead of column-major storage format for functions and matrices for easier understanding. Using which order is an implementation choice, and does not constrain the methodology.
Row number & Notation Semantics
Temporal definition Create an on-chip buffer for B, at loop level i if specified and otherwise before all the loops of function A. The buffer can be double-buffer and/or implemented in registers, or by default a single buffer implemented with on-chip memory blocks. Data forwarding 14 A.relay(B, <d1, d2, …>)
In the systolic array of A, among the incoming B values, every PE keeps the values that belong to itself, and forwards the values that belong to other PEs to the neighbor PE whose identifier vector equals this PE's identifier vector + <d1, d2, …>.
At the boundary of the systolic array, the compiler may automatically wrap around a relay, or drop an unnecessary relay as an optimization [21] . For each spatial computation, there may be more directives given, including data caching, data forwarding, and further loop transformation, etc. The compiler correspondingly specializes an individual computation's IR. When unrolling is specified, the compiler creates multiple copies of the IR, assigns each a unique identifier and specializes the unrolled loops' index variables with the identifier, as illustrated in row 8 of Table 2 . If there was an input (or output) channel to the IR before unrolling, the channel is split into multiple channels as well, all connecting to the same source (or destination) as before unrolling. If subsequently, the source (or destination) is also unrolled, these channels will be further split. So with unrolling, one may create very sophisticated spatial hardware, like that shown in Figure 3 Besides the high-level optimizations directed by the T2S code, the compiler can transparently performs traditional low-level and target-specific optimizations.
For example, the compiler may software pipeline a loop, minimize the depth of a channel, remove redundant or dead code, etc. The compiler may "infinitize" a loop: e.g. in Figure 4 (a), since the middle piece no long uses the loop variable i and the piece works only when its input channel has values available, the compiler may transparently replace the loop "for i = 0 .. I" in the middle piece as an infinite loop, which uses less spatial resources. For another example, the compiler may generate efficient code for reduction, which is a very common pattern: For a reduction, the compiler can generate code to use a register to accumulate the result. Only when the reduction is completely done, the compiler lets the result in the register to be sent to another place, like a channel. If the reduction is inside a loop, multiple registers may be needed and for efficiency, the compiler can organize these registers as rotating registers (i.e. registers in a cyclic structure) [36] . This will be illustrated with SGEMM in Section 6.2. 6. Code generation.
The compiler generates a spatial hardware (image), and the code for host-side PEs and host-device communication. Overall, the compiler techniques for the above tasks exist today. A possible flow for implementing a compiler is shown in Figure 5 , leveraging existing techniques like Halide, LLVM and HLS or HDL compiler.
Case studies
In this section, we study in depth several representative, high-profile workloads as described in Table 1 (except we use Smith-Waterman instead of PairHMM for simplicity without losing any key points). Given a high-performance design for each of them, we show how to express the design in T2S on spatial architectures, all in ~20 lines of code or less. However, we should point out that there can be many designs for each workload, and our methodology is generally applicable, not limited to these specific designs.
These designs are for high-performance, and thus in general, have many optimizations. That is the nature of HPC programming, though. Fortunately, the optimizations they use are typical optimizations taught in college, and once understood, can be useful for many other designs. We will start simple, and guide the reader through the optimizations as gently as possibly.
Smith-Waterman
Smith-Waterman is a stencil computation, where to compute a data point, 3 neighbor data points are needed. The algorithm has important usage in bio-sequence alignment. It 
HLS compiler HDL compiler
Spatial hardware (image) Figure 5 . A possible compiler implementation flow has a dependence structure similar to that of PairHMM [13] , but its computation is simpler. So we use it to illustrate T2S for easier understanding. A high-performance design for Smith-Waterman, adapted from a commercial implementation of PairHMM [13] , is expressed in T2S in Figure 6(a) .
First, we specify a temporal definition (Line 1~8), which tells the compiler to build an IR for a loop nest as shown in Figure 6(b) , where f is a function (not shown), and S and T are bio-sequences to match. The code is self-explanatory. Note that only the last value of A is useful, indicated by "store(I -1, J -1)" in Line 8. So the compiler may avoid saving other A values.
Then we specify a spatial mapping as follows: 1. Transform the loops.
Line 11 tells the compiler to tile the loop nest as illustrated in Figure 6 (c). The inner two loops compare a pair of fixed-sized sub-sequences, and the outer two loops enumerate all such pairs. The inner two loops will be unrolled next. Such tiling followed by unrolling is a general strategy to decompose a big problem into subproblems such that each sub-problem can be solved with the available hardware resources on a device. 2. Build a basic spatial layout.
Line 13 tells the compiler to isolate a driver function as the producer of S and T, and as both the producer and the consumer of A. This results in a basic spatial layout as shown in Figure 6 (d). At this moment, no function has ever been unrolled yet.
Specialize individual computations.
Line 14 specializes function A. First, the innermost two loops are fully unrolled. This creates the orange-colored PEs shown in Figure 6 (e), where each PE has a unique identifier annotated on top of it, corresponding to the unrolled ii and jj loop's values. The two channels originally connected with function A are automatically split by the compiler so as to connect with these PEs of function A, instead. Second, the A values are forwarded in 3 directions: vertical, horizontal, and diagonal. The S and T values are forwarded horizontally and vertically, respectively. In effect, the unrolled PEs with the data forwarding between them consist of a systolic array. What about the data forwarding at the boundary of the systolic array? Along the 3 directions, if a value is coming from outside of the systolic array, it is from the producer of it, viz. the driver, via the bold black-colored channel in Figure 6 (e). Similarly, if a value is sent outside of the systolic array, it is to the consumer of it, viz. the driver as well, via a blue-colored channel in Figure 6 (e). With existing techniques [21] , the compiler can automatically connect the PEs to the right channels. Based on a producer PE's loop nest, the compiler knows the data to be sent over a channel, and thus determines the depth of the channel. For example, the compiler can figure out that at most MAX_J number of values will be accumulated in each of the blue-colored channels, and sets up their depths accordingly. Here MAX_J is a symbolic constant, an upper bound of J (Line 9).
Finally, Line 14 also tells the compiler to create a register buffer at loop level j for A values, which encloses the two innermost loops that have become the systolic array. That is, the compiler will create a buffer big enough to provide all the input A values for one execution of the systolic array. All the input A values are redirected to this buffer, and from there to the PEs appropriately.
SGEMM
The problem is to compute a matrix C given matrix A and B: C= A * B. The T2S code for a commercial design [1] is shown in Figure 3(a) . First, we specify a temporal definition (Line 1~7). The corresponding IR built by the compiler is shown in Figure 7 (a).
Line 10 of Figure 3 (a) tiles each loop two times. The IR is manipulated by the compiler correspondingly, and the result is shown in Figure 7 (b).
To understand the purpose of the tiling, look at Figure  7 (c). The basic idea is, through tiling of the loops, to block the matrices; then multiply a row of A blocks with a column of B blocks to compute one C block, and repeat the same process to compute all C blocks. In more detail, the matrices are blocked at two levels. At the first level, the matrices are blocked in rows and columns, leading to "blocks". At the second level, each block is further blocked in rows and columns, leading to "sub-blocks". That is why the loops are tiled twice. The multiplication of a row of A blocks with a column of B blocks is done via rank-1 updates. That is, take the 1 st column of A sub-blocks and the 1 st row of B sub-blocks, which are highlighted by the green and blue backgrounds in Figure 7 (c), respectively, compute their outer product, and accumulate the result into the current C block. Then take the 2 nd column of A sub-blocks and the 2 nd row of B sub-blocks, etc. 2. Build a basic spatial layout.
Line 12 in Figure 3 (a) separate out the functionality of the loop nest in Figure 7 (b) into multiple functions, according to their producer-consumer relationship. This results in a basic spatial layout. There are 3 communication paths between the host and the device, composed of memory and register channels. We have illustrated two of the paths in Row 11~12 of So far, every function is a single PE itself. We can specialize the functions individually. Some of them may be specialized into systolic arrays, and some of them may get their loops transformed further. a) Specialize function C.
First, unroll the loop nest of function C along two middle loops ii and jj (Line 13 of Figure 3(a) ). This creates a 2-dimensional systolic array, the orange PEs in Figure 3(b) . All the PEs, however, are accessing memory, which is not efficient. Instead, a PE at the left boundary of the systolic array can load A values from memory, and forward them to its right neighbor, i.e. along the direction vector of <0, 1>. Similarly, a PE at the top boundary of the systolic array can load B values from memory, and forward them to its neighbor below, i.e. along the direction vector of <1, 0>. Also similarly, a PE at the bottom boundary of the array can forward a resulting C data to its neighbor above, i.e. along the direction vector of <-1, 0>. Such data forwarding between PEs are dictated by the "relay" directives (Line 13 of Figure 3(a) ). Consequently, all the PEs in the 2-D systolic array get connected via channels. b) Specialize helper functions for A.
These helpers consist of A_serializer, A_loader and A_feeder. They are responsible to smoothly feed matrix A's data into the 2-D systolic array. Since A_serializer is on the host side, while A_loader is on the device side, the compiler automatically creates a memory channel between them, as highlighted in a green-colored dotted line in Figure 3(b) . The memory channel is implemented by using the host and device memory: the compiler instruments A_serializer so that the stream of A values loaded by A_serializer are written into the host memory sequentially, then copied to the device memory (by the host-device communication code also generated by the compiler), which is sequentially read by A_loader. This in effect "serializes" matrix A.
There is a subtle point here: A_serializer should avoid writing the same data twice for performance. That is what Line 14 of Figure 3 (a) does: all the loops not related with matrix A are removed. Similarly, A_loader removed all such loops except loop j (Line 15 of Figure 3(a) ). Loop j is kept because the device has limited on-chip memory to hold the data. So the same set of data have to be reloaded for each different loop j iteration. Now it is clear that although a memory channel is conceptually a FIFO, the data stream from the producer may be "reloaded" in the consumer side, because it is implemented in memory. Further, the reading of the memory channel by A_loader is vectorized at the innermost loop level (Line 15 of Figure  3 (a)): because matrix A is serialized, one load can load multiple data from the contiguous memory locations. Finally, in Line 16 of Figure 3 (a), loop ii of A_feeder is unrolled into another set of PEs. Incoming data are forwarded from a PE to the next if the data do not belong to the current PE, and otherwise are kept in a double buffer in the current PE. The compiler automatically figures out to which location in the double buffer a data item should be written. This in effect "de-serializes" matrix A. c) Specialize helper functions for B.
These functions are specialized in the same way, as specified in Line 17~19 of Figure 3 (a). d) Specialize helper functions for C.
These helpers consist of C_collector, C_unloader and C_deserializer (Line 20~22 of Figure 3(a) ). They are responsible for smoothly draining matrix C's data into the host memory. They are also similar to A's helpers, except the data flow from the device to the host. A subtle point here is the "remove(k, kk, kkk).reorder(jj, jjj, iii)" directives in them. These directives remove the loops unrelated with C and move loop jj into the innermost level. Remember that jj is the horizontal dimension of the orange-colored 2-D systolic array (See Figure 3(b) ). As such, one data item is read from one column of the 2-D systolic array, and then another data item from the next column, etc. In this way, no column of the 2-D systolic array will be stalled for too long a time in draining its data.
Convolution and ReLU, SpMV and merge sort
We briefly introduce how T2S is applied to these workloads, highlighting only the key points. We leave a detailed description to the Appendix.
6.3.1
Convolution and ReLU Convolution layer is the most compute-intensive part in a convolutional neural network, often followed by a ReLU layer. Here we express a high-performance design in the literature [14] . Following the T2S coding template, we first specify a ReLU function, as well as a convolution function, whose output is used by the ReLU function. That is, the convolution and ReLU function have producer-consumer relationship. We have two loop nests: one loop nest for each of the two functions. First, we tile some loops of the two loop nests. Then the two loop nests (functions) are specialized individually.
For the ReLU function, two loops in it are unrolled to form a 2-D systolic array, and a buffer is added at certain loop level for caching its output.
For the convolution function, three loops in it are unrolled to form a 3-D systolic array. Two buffers, one a regular buffer and the other a line buffer, are added at an outer and an inner loop, respectively, to build a two-level data cache hierarchy, from which the PEs of the 3-D systolic array read their input.
Before unrolling, the two functions are connected via a channel. After unrolling, the channel is naturally split into many channels, connecting the 2-D and 3-D systolic array's PEs as producers and consumers. It can be complicated for human, but is mechanical for the compiler (Section 5).
6.3.2
SpMV SpMV is to compute y = A * x, where x and y are dense column vectors and A is a sparse matrix. Computing patterns using sparse matrices have irregular memory access, which implies low utilization of the bandwidth of the external memory of the device. Also, the patterns have data-dependent parallelism, which means that some loops of the patterns might have dynamic trip counts. For example, in SpMV, when a row of the matrix A is multiplied with the vector x, the length of the row (i.e. the number of non-zeros in the row) is data dependent, which is not statically known.
To address these issues, a high-performance design of SpMV [15] preprocesses a sparse matrix on the host. Conceptually, after preprocessing, SpMV becomes much like a dense matrix computation, as shown below:
11 for c = 0 .. C for r = 0 .. NUM_SLOTS y(decoder(row_lengths)) += A'(c, r) * x(column_ids(c, r)); Here NUM_SLOTS is a compile-time constant, representing the number of physical memory slots that can be accessed in parallel from the device, C is a data-dependent runtime constant, and A' is the preprocessed matrix: multiple rows of the original matrix A have been statically scheduled into a single row of A'; there are totally NUM_SLOTS of rows in the new matrix A' -one row for a memory slot, and all the rows of A' have been padded so that they all have C number of columns. The lengths of the rows of the original matrix A have been recorded into another matrix named row_lengths. There is a new function (decoder) and another matrix (column_ids) tell from which row and column of the original matrix A, respectively, the current A' element comes from. For that purpose, the decoder function reads matrix row_lengths and deduces the current row's index.
After the preprocessing, the memory access patterns of A', column_ids, and row_lengths are regular and contiguous.
As we can see, the new loop nest has an overall dataflow structure. However, the "decoder" is not purely functional: it contains internal states. Besides, we will isolate the loading of all the matrices (A', column_ids, and row_lenghts) into a spatial piece called "matrix_fetcher", which is not exactly a dataflow actor: the matrix_fetcher fetches a data item from each of the matrices every time, and thus may finish fetching matrix row_lengths earlier as it is shorter than the other two matrices. In other words, one of the 3 inputs may no longer contain tokens at some time, but the matrix_fetcher still needs to fire (for the other two inputs). We can write the decoder and matrix_fetcher imperatively and encapsulate them as if they were dataflow functions.
We leave the details of SpMV to the Appendix.
6.3.3
Merge sort Merge sort is a reduction in tree style. Here we express the merge sorter tree in FPGASort [16] . A number of data streams are read simultaneously, and through a binary tree, merged into a single stream. At each level of the tree, the input streams stay in FIFOs (i.e. channels). The closer the level is to the root of the tree, the deeper the channels. Each tree node merges two incoming streams. Such a node has internal states and must be written in an imperative language, but it can be properly encapsulated into a dataflow function.
In T2S, all the tree nodes can be expressed as the PEs after unrolling some loops. The tree shape can be constructed by forwarding data from a node at some position m of a tree level to a node at position m/2 of the next higher tree level. This in effect constructs a tree-like systolic array. The depths of the channels at each tree level can be specified using the "set_min_depth" directive in Table 3 .
Related work
Related work comes from languages and their compilers, and programming paradigms. The relevant languages include HDLs, HLS languages, and domain-specific languages (DSLs).
Dataflow programming and communicating sequential processes (CSP) are relevant programming paradigms.
HDLs mainly include Verilog and VHDL. They describe a circuit at the register-transfer level (RTL) with explicit timing. They can be compared to "assembly languages".
HLS languages have a higher abstraction. An HLS program gives an algorithmic description of a desired behavior. The behavior is usually decoupled from clock-level timing. There are many HLS languages [8, 45~50] . The most common ones are based on standard languages such as C/C++/SystemC/Matlab. Source code is analyzed, architecture-constrained, and lowered down to an HDL.
Compared with a HLS program, T2S code is more succinct and at a higher abstraction level. T2S constructively specifies the behavior of a spatial hardware, and lets the compiler to generate details according to the specification, instead of letting the programmer directly write the details.
Another major advantage of T2S over HLS is that a T2S specification lends itself to static, automatic verification.
DSLs also have a higher abstraction level than HLS languages. Such languages express and optimize an algorithm in predefined domain-specific patterns, and lower the patterns into an HDL [11, 12, 30 , 33~35]. The Halide [11] and Halide-HLS [12] DSLs have been introduced in Section 1. T2S may be implemented based on them.
Related compiler techniques have been discussed in Section 5.
Dataflow programming [38] models a computation as a direct graph, where nodes are operations, edges are channels, and data flow from nodes to nodes along the edges.
CSP applies dataflow to describe a concurrent system with component processes working independently and communicating with message passing via channels [37] .
T2S can be viewed as a succinct way to construct a special CSP system, where every process (i.e. PE) is the result of isolation and specialization from a single loop nest.
Conclusion and future work
"It is remarkable how complex a simple computation can be when performance is at stake" [31] . High-performance programming is complicated by optimizations. We have shown a programming methodology, namely T2S, which substantially reduce the complexity of high-performance spatial programming by separating the concerns of temporal definition and spatial mapping. T2S enables programmers to succinctly specify strategic loop and data optimizations, and leaves to a compiler the implementation of these optimizations, the verification of them, and all the other lowlevel and target-specific optimizations. In this way, we show the promise of fundamentally improving the productivity of high-performance spatial programming.
We are planning for implementing a language and compiler for T2S based on Halide/Halide-HLS.
A future research is to specify a workload in a simple or familiar language (e.g. in un-optimized C/Python or Alpha [29] code), and have a tool (based on e.g. roofline or polyhedral model like PolyMage [32] ) to make the decisions on loop and data transformation, and thus automatically generate T2S code -for even higher productivity. .
isolate_producer_chain(T, driver) .isolate_producer_chain(A, driver)
.isolate_consumer_chain(A, driver); F2.define_extern( F2_impl , void, {int i, channel int channel1, channel float channel2}); The first parameter (int i) is explicit, as the compiler sees from "F2(i, A)". The two channels are implicit parameters: the first channel (channel1) is created when the compiler sees "F2.isolate_producer_chain(A, F3)"; the second channel (channel2) is created as the output channel of F2 at the end (For convenience, we assume the output channel of a function is always created the last).
An implementation of F2 should follow the above functional interface, and be realized in an imperative language. The compiler will automatically replace any invocation of function F2 with that implementation. In the above example, F3 is the result of isolation. Usually, if a function F is isolated from another function that has a loop nest specified, the compiler can automatically determine the loop nest for F. But in the above example, since the original function from which F3 is isolated, i.e. function F2, is defined imperatively and has no loop nest specified, the compiler cannot determine F3's loop nest. Therefore, F3 needs to be defined imperatively as well. It can be specified as follows:
channel int out}); Here the first parameter is the integer matrix A, which appears explicitly in "F2.isolate_producer_chain(A, F3)". The compiler will automatically associate matrix A with the first parameter for the implementation to be invoked. The other parameter is implicit. It is the output channel the compiler created for F3.
Below we look at a few workloads in detail. The two loop nests are further isolated into spatial pieces (Line 14~16 of Figure 8(a) ).
A.1. Convolution and ReLU

Specialize individual computations.
ReLU is unrolled into a 2-D systolic array, and the results of ReLU are buffered at an outer loop level y (Line 17 of Figure 8 (a)). Convolution is unrolled into a 3-D systolic array (Line 18). As the convolution and ReLU function have producer-consumer relationship, there is implicitly a channel between them according to our convention. As unrolling goes, the channel is split into many channels, and the PEs in the two systolic arrays are naturally connected with the channels, according to their producerconsumer relationship. For the input loader, some unrelated loops are removed to avoid redundant loads (Line 19). For compensation, a buffer is built at loop level y on the side of the consumer (i.e. the input feeder) (Line 20 Figure 8(a) ). The buffer and the line buffer are located at an outer and an inner loop, respectively, and thus the compiler knows to fill the line buffer from the buffer. This in effect builds a two-level user-managed data cache hierarchy. This is a general way to construct a multlevel user-managed data cache hierarchy. Of course, in general, each level can be either a buffer or line buffer. The loader and feeder for the weight are similar (Line 21~22 of Figure 8(a) ). 
A.2. SpMV
This SpMV design is based on literature [15] . The temporal definition of SpMV is shown in Line 1~7 of Figure 9(a) . In this definition, we pre-split the multiplication operation out as a "product" function, because so far, we isolate functionality based on data, not on operation. So there are two functions, product and y, defined. Their corresponding loop nests are shown in Figure 9 (b).
The spatial mapping is as follows: 1. Build a basic spatial layout.
A "matrix_fetcher" is isolated out to fetch all the matrices, including A', column_ids, and row_lengths (Line 8~11 of of Figure 9 (a)). 2. Specialize individual computations.
In Line 14 of of Figure 9 (a), first, the inner loop of function y is tiled with a factor NUM_ACCS, which is the number of accumulators in the design. Then the tiled inner loop is unrolled to create NUM_ACCS number of accumulators (a 1-D systolic array), each will be responsible for accumulating results for NUM_SLOTS/NUM_ACCS number of rows. Such an accumulator was named as a "fused accumulator" in the literature [15] . The accumulated results of y are buffered outside the loop nest of function y. In Line 15 of of Figure 9 (a), the product function's inner loop is fully unrolled into another 1-D systolic array, and the input vector x are buffered outside of the loop nest of the product function. The spatial hardware corresponding to the T2S code is shown in Figure 9 (c).
As we said in Section 6.3.2, decoder is not purely functional and matrix_fetcher is not strictly a dataflow actor. So we write both of them imperatively, and encapsulate them as dataflow functions. The correspondence of the two functions to their imperative implementations is shown in Line 16~17 of Figure  9 (a) using the "define_extern" directive. Figure 10 conceptually shows the imperative implementations of the decoder and matrix_fetcher, where NUM_ROWS is the number of rows in the original matrix A.
A.3. Merge sort
The temporal definition is shown in Line 1~7 of Figure  11 (a).
In this temporal definition, the original input is a 1-D array of integer arrays (Line 1). One can vision each of the input arrays as a stream of data.
There are two symbolic constants, L and S (Line 2). L is the total number of levels in the merge sort tree. S is the size of one input array to a node at the first level. For the next level, the size of an input doubles.
Each node in the merge sort tree is located by two parameters, one for the level in the tree, the other for its position at the level. The node's level is 1 less than the level of its parent node, if any. The node's position at its level divided by 2 is its parent node's position in the next level.
Every node merges two streams of data. The merge function used in Line 5~6 has 3 parameters: the first parameter is the size of one input to the node, and the other two parameters are the input channels. This merge function will be implemented imperatively.
The original loop nest corresponding to the temporal definition is shown in Figure 11 (b). Now we define a spatial mapping: 1. Build a basic spatial layout.
In Line 8~9 of Figure 11 (a), a loader for the input is isolated out. 2. Specialize individual computations.
In Line 10 of Figure 11 (a), the output function's loop nest is fully unrolled at both loop levels. Then a PE (i.e. a tree node) forwards data toward its parent PE (i.e. the parent node of the tree node, if any). The PE is indexed by the two unrolled loop variables l and m. Its parent PE, if any, is then indexed by l+1 and floor(m/2). Therefore, the direction vector for forwarding the data is <1, floor(m/2) -m>. As we said, each PE merges two data streams. This merging is not functional: it compares two data items from the two streams, writes the smaller one to the output, and keeps the bigger one for the next comparison. That is, the merging has an internal state. So in Line 11, the merge function is specified via a define_extern directive. The spatial hardware corresponding to the T2S code is shown in Figure 12 (a). A conceptual implementation of the merge function is illustrated in Figure 12 (b). 
