Abstract-The problem of synthesis of gate-level descriptions of digital circuits from behavioural specifications written in higherlevel programming languages (hardware compilation) has been studied for a long time yet a definitive solution has not been forthcoming. The emphasis of the first part of this tutorial is methodological, arguing that one of the major obstacles in the way of hardware compilation becoming a useful and mature technology is the lack of a well defined function interface model, i.e. a canonical way in which functions communicate with arguments. In the second part we present a solution based on the Geometry of Synthesis, a semantics-directed approach to hardware compilation.
I. INTRODUCTION
The problem of hardware compilation turned out to be surprisingly difficult. Although the pioneering work of van Berkel [1] , Page, Luk and the Oxford hardware compilation group [2] , [3] , [4] yielded promising initial results, more than a decade later this technology has yet to enter the mainstream of digital design. Several C-to-gates hardware compilers, such as ROCCC [5] , CASH [6] and HANDELC [7] have been developed, but their take-up was limited.
In this paper, when we refer to existing hardware compilers we mean such as above. We will not refer to design flows based on SYSTEMC 1 or COWARE 2 , hardware compilers based on process calculi (e.g. [8] , [9] ), or higher-order structural languages such as LAVA [10] ; these are interesting and useful, but conceptually different ways of approaching VLSI design.
Why did hardware compilers not meet expectations? Part of the answer has to do with performance, as higherlevel behavioural design is unlikely to be as efficient as handcrafted designs. However, the economics of reconfigurable architectures such as field-programmable gate arrays (FPGAs) or other complex programmable logic devices (CPLDs) are such that the design costs can often become the overriding concern, allowing for weaker performance as a trade-off, especially since their competition is often not custom-design application-specific circuits (ASIC) but (embedded) software. This is a well known consideration, so hardware compilers often target reconfigurable architectures. 1 www.systemc.org 2 www.coware.com Another performance-related issue is that of concurrency; the performance advantage of custom hardware comes from its potential for massive parallelism, rather than high clock rates. Conventional programming languages that can serve as candidates for compilation into hardware (C, Java, etc.) have either no built-in support for it or offer unsuitable concurrency models such as threads or processes. This is a serious dilemma, because a concurrency model needs to be both low-level enough to reflect the underlying capabilities of the platform ("close-to-metal") and thus give predictable performance, and also high-level enough to unburden the programmer from detailed management and book-keeping of resources. An example of a successful such trade-off is Nvidia's CUDA platform 3 . While the arguments above carry some strength and reflect the conventional wisdom on the topic, we will argue that the key problems to be resolved are of a qualitative rather than quantitiative nature. Namely, one such key problem is the lack of canonical function interface models (FIM) in hardware. We will briefly examine their traditional role in software compilation and operating systems, and explain the possible reasons for their absence in hardware compilers. We also propose a simple but non-trivial FIM, illustrate it with several typical examples and explain how such FIMs can be canonically designed. § In modern programming-language theory the notion of type is paramount. Beyond just classifying data in various categories such as integers, floating-point, strings, etc., types fill a much more fundamental role. It is difficult to give a better summary that does justice to the importance of types than Robert Harper's 4 
:
Over the last two decades type theory has emerged as the central organising framework for the design and implementation of programming languages. Type theory (the study of type systems) provides the theoretical foundation for safe component integration. In the words of John Reynolds, "a type system is a syntactic discipline for enforcing levels of abstraction". By syntactic we mean that the type system is part of the program, rather than purely in the mind of the implementer. By discipline we mean that type restrictions are enforced; ill-typed combinations are ruled out by the type system. By levels of abstraction we mean the clean separation between conceptually distinct data objects that may, in a particular program or compiler, have the same or similar representations.
Implicit in this definition is an important principle: the power of a type system lies as much in what it precludes as what it allows. The most powerful type system of all is the one that rules out all programs. However, such a type system, while surely preventing safety violations, is hardly very useful. The goal of type system design is to increase expressiveness by admitting useful programs, but without compromising safety. We will see in the following how type systems are indeed essential in hardware compilation of programming languages because of the special challenge of "safe component integration" in hardware, as opposed to software.
II. FUNCTION INTERFACE MODELS
Functions and related concepts (procedures, methods, subroutines, etc.) are the main mechanism for implementing abstraction in a programming language. The importance of using functional abstractions lies at the core of any good programming methodology and its benefits have long been established. Functions play a fundamental role in the operation of a conventional stored-program computer and they were in fact a feature of the first such computer, the EDSAC [11] . Except for very early models such as the HP 2100, microprocessors always supported function call in their instruction set directly (e.g. Intel's x86) or at least provided instructions for stack management, meant mainly to implement function calls (RISC architectures).
Two of the early mainstream programming languages (FOR-TRAN and LISP) provided support for functions. The most advanced support for functions was to be found in LISP. Not only does it support higher-order functions (functions that take functions as arguments) but it also introduced the new concept of a Foreign Function Interface: a way for a program written in LISP to call functions written in another programming language. This idea was adopted by all subsequent programming languages that had mature compilers, under one name or another. A special and privileged position is that of the C programming language; because of the close relationship between C and the operating system its own calling convention is usually implemented directly by the OS, under the name Application binary interface (ABI).
One of the decisions of the C and Unix designers with the farthest-reaching consequences was to make the details of the calling convention standard and public [12] . The positive effects of this decision cannot be overstated, as it massively improved the interoperability of applications by opening the OS functionality to other programming languages and by facilitating support for stand-alone language-independent libraries.
In this paper, to avoid ambiguity, we introduce the terminology of function interface model (FIM) to encompass the closely-related notions of the FFI and ABI.
A. FIMs and hardware compilation
It is taken as a given that stored-program computers must offer well-defined FIMs. It is inconceivable to design a modern operating system or compiler if this fundamental requirement is not met. On the other hand, in the world of hardware the concept of a FIM simply did not arise. The net-lists (boxes and wires) that are the underlying computational models of hardware languages are not structured in a way that suggests any obvious native FIM.
The abstraction mechanism common in hardware languages, the module, is a form of placing a design inside a conceptual box with a specified interface. This does serve the purpose of hiding the implementation details of the module from its user, which is one of the main advantages of abstraction. However, a module is a form of abstraction weaker than functions: 1) Modules are always first order. A module can only take data as input and not other modules; in contrast, functions can take other functions as argument. 2) Modules must be explicitly instantiated. The hardware designer must manually keep track of the names of each module occurrence and its ports, which must be connected explicitly by wires to other elements of the design. In contrast, the run-time management of various instances of functions is handled by the OS (using activation records) in a way that is transparent to the programmer. 3) Sharing a module from various points in a design is not supported by hardware design languages. Ad hoc circuitry must be designed in order to achieve this. The lack of language support for sharing makes this a laborious and error-prone task.
These limitations may seem inconsequential, and in a certain sense they are so. They are not limitations on the expressiveness of hardware design languages or the performance of the circuits synthesised. However, the impact of these limitations becomes more serious as designs become more complex. For the design of, say, an adder the limitations above are irrelevant. But if the design is an implementation of a complex algorithm that uses many instances of the same kind of modules, interacting in non-trivial ways the burden of micro-managing modules and ports and the inability to use generic algorithms can become onerous. The burden of managing sharing is also substantial, and the conventional solutions to this problem, such as bus or network-on-chip (NOC) architectures, are complex, heavy-duty and not provided with language-level support. Generally, existent hardware compilers are not built around well-defined FIMs. What would be the unthinkable in the world of software compilation is the norm in the world of 132 hardware compilation. But this is not entirely surprising, considering the fact that hardware does not have a "native" FIM to offer. Its absence damages the usability and the performance of the compilers, imposes restrictions on interoperability and prevents library support.
Let us briefly consider how functions are managed in the hardware compilers mentioned earlier. Most compilers (ROCCC is an example) handle functions on-the-cheap, via inlining, i.e. the textual substitution of the body of the function at the point of call. This is not only inefficient, requiring the replication of the instantiation of the function body at each call, but also incompatible with separate compilation, pre-compiled libraries and interfacing via function calls with native HDL-based IP cores. At least one compiler (CASH) uses a token-based mechanism which represents a genuine attempt at a FIM, but all function calls must be indirected via a global memory, which is too heavy duty and inefficient. A particular ambitious attempt is HANDELC, which uses an efficient, but incorrect, way of sharing circuitry statically via functions. The reason is that HANDELC's sharing of function calls in the presence of concurrency leads to race conditions on the input ports of the synthesised circuits, which leads to erratic behaviour. For example, the code x=f(1)||y=f(0) stores the same value in registers x,y because only one of the two arguments (1 or 0) is passed to function f(), which is shared in the concurrent context.
III. STATIC INTERFACES FOR HARDWARE FIMS
The language of hardware design is essentially a language of topological graph-based diagrams. By graph-based diagrams we mean just the usual concept of two-dimensional drawings of boxes connected by wires. By topological we mean that at the preferred level of abstraction we do not want to consider the geometrical properties of the diagram (placement, size, orientation) but only what box is connected to what box. In a technical sense this information is topological because it is preserved by all smooth deformations of the diagram.
The so-called structural fragment of conventional hardware description languages (HDL) such as VHDL or VERILOG is a diagram description language. It works by naming explicitly all types of boxes used in the circuit, including their ports, and specifying connections either by giving ports the same name or by introducing additional entities called wires which realise connections also through name matching. For example, the circuit below
is described in typical HDL syntax as: module dw(a1, a2, f, z1, z2) input a1, a2, f; output z1, z2; wire w1, w2; xor x1(f, z2, w1); xor x2(f, z1, w2); c c1(a1, w1, z1); c c2(a2, w2, z2); endmodule What is typical of this syntax is the use of a large number of names, which obscures the topological structure of the circuit. The first step towards discovering the FIM of hardware is a better structured diagrammatic notation.
Diagrams can be specified using algebraic rather than names-based representations. In the area of circuit design, the pioneering work of Sheeran and her collaborators is remarkable, and it culminated in the Haskell-based system LAVA [10] . The combinators of LAVA correspond to an algebraic system called traced monoidal categories, which was later formalised independently by Joyal, Ross and Verity [13] . These combinators formalise in an elegant way the intuitive ideas of composition and feedback in diagram circuits. However, if the functional structure is to be made perspicuous an even more appropriate set of combinators exist, based on the compact closed categories studied by Kelly and Laplaza [14] . Other categories of diagrammatic combinators exist, and some could be more suitable for different purposes, as surveyed by Selinger [15] .
Without delving into the precise technical definitions we will express the ideas of compact closed categorical combinators directly in the language of digital circuit diagrams.
The first basic notion is that of interface, taken as a list of ports with assigned input or output polarities, which we shall write as A, B, C, . . .. We write I for the trivial interface with no ports.
The basic operation on interfaces is input-output reversal written A * , which is applied to its ports element-wise. Note that this is obviously an involution, i.e. (A * ) * = A. Another operation on interfaces is a concatenation, which we shall call tensor and write as A ⊗ B.
For a very concrete example, a component video out interface would be written in our notation as Y ⊗ P b ⊗ P r:
A useful auxiliary operation on interfaces is defined in terms of the previous two: A ⇒ B = A * ⊗ B. Circuits are identified by their name and their interface and are written as f, g, h, . . .. If we write f : A → B we mean that f is a circuit with interface A ⇒ B and we draw it as the following diagram:
Note that, by convention, the interface drawn on the "left" side of the circuit has reversed polarities. This is a generalisation of the HDL convention where inputs are usually drawn on the left and output on the right. Circuits can be composed serially or in parallel. If f : A → B and g : B → C then f ; g : A → C, as in this diagram:
A special circuit id A : A → A consists only of wires connecting the ports on the left to the corresponding ports on the right. Note that even in this simple setting many algebraic properties hold, such as associativity of composition (f ; g); h = f ; (g; h) or the functorial interplay between parallel and sequential composition (f ⊗ g); (f ⊗ g ) = (f ; f ) ⊗ (g; g ). The latter represents two alternative but equal ways of describing the diagram
An important class of circuits are those called isomorphisms. These are circuits consisting of wires only, such as the identity. However, whereas an identity only operates between two copies of the same interface, an isomorphism exists between interfaces of similar shape, more generally construed. If an isomorphism between to interfaces exists we write α : A B. Isomorphisms are written using Greek lower case letters.
A good intuition for isomorphisms is that of adaptor. The fundamental property of the isomorphisms is that it is invertible, i.e. if there exists α : A B there exists α −1 : B A such that α; α −1 = id A and α −1 ; α = id B . This intuition is neatly illustrated by reverse socket adaptors:
In diagrams as used for circuits certain isomorphisms always exist, in particular γ A,B : A ⊗ B B ⊗ A, which means that such diagrams enjoy a property called braiding. Moreover, we have the property that γ A,B = γ −1 B,A , which is a stronger property called symmetry and which establishes that the order in which we list the ports in an interface is essentially immaterial as the right adaptors always exist. This axiom is illustrated graphically by this diagram:
As an aside, diagrams where braiding exists but not symmetry are known as knot diagrams. If we want to reason about knots then obviously a braiding as above should be distinguished from the identity.
Note that braiding and composition also interplay nicely, in a way which is said to be natural and exemplified by the equality (f ⊗ g); γ A,B = γ B,A ; (g ⊗ f ). This equality establishes that these two diagrams below are equal:
The last component in our diagram construction kit describes a notion of feedback and it says that for any interface A * ⊗ A there exist feedback loops to the left, written as η : I → A ⊗ A * and to the right : Kelly and Laplaza [14] have proved two essential results about compact closed combinators. First, any graph-like diagram of boxes and wires can be expressed in the language of compact closed categories and second, two diagrams are equal up to graph isomorphisms if and only if their categorical description can be proved equal in the equational theory of compact closed categories. In other words, the language of compact closed combinators is ideally suited for expressing box-and-wire diagrams.
Let us rework our initial example in this language. The only basic interface is 1, the one-bit port. To avoid special symbols let us write "." for ⊗, 1 for id 1 , gam for γ 1,1 , e for 1 and n for η 1 . In addition to the gates used explicitly in the example (xor, c) let us also use fork, a "gate" with one input and two outputs consisting of a forking wire. The module dw can be written as:
1.fork.1; 2.e.e.2; 1.xor.gam.xor.1; c.1.1.c; fork.1.1.fork; 1.n.n.1 Note the fact that no instance names for modules, ports or wires need to be introduced, a hallmark of higher-level languages.
But the reason we introduced compact closed combinators is because they make functional structure apparent! In the abstract setting of category theory, functions are said to exist if and only if for any interface A ⇒ B there exists a special circuit called eval A,B :
It is a property of compact closed categories that such a circuit always exists and is defined as eval A,B = A ⊗ id B and h is unique, defined by
Diagrammatically this property is expressed as:
In terms of functional programming concepts, if f is a function then h is its curried version and eval is function application. Categories of diagrams which have a notion of function are called closed monoidal and we just showed how they are constructed in a canonical way out of compact closed combinators. Because a closed monoidal category can be constructed out of a compact closed one it is said to have "less structure". From now on we will work directly in the monoidal closed setting as not all the rich structure of the compact closed category can be used in the context of higherlevel synthesis.
Which brings us to the conclusion of this section, which may seem somewhat surprising:
In the language of diagrams there exists a canonical notion of functional interface, which arises from a particular discipline of the interconnections.
IV. PROGRAM SYNTHESIS FROM GAME SEMANTICS
In the previous section we saw how we can write diagrams compositionally using the syntax of currying and application.
If we have a programming language where programs are constructed using these basic operations then obviously we can apply the method to synthesising complex circuits out of basic ones. It is not surprising that this way of constructing programs is suitable for functional languages, but in fact imperative programs can also be disguised into a useful functional syntax.
Consider for example a typical imperative statement such as x = y + 1. It is constructed out of variables, assignment, dereferencing, addition and constants, with the assignment and the addition denoted by special lexical tokens and used in infix form. But we can just as well write it as assg (x, add(deref y, 1)), using identifiers in prefix notation, a functionalized syntax. Even binders such as local variables can be expressed functionally. The block integer x in C can be functionalized as integer(λx.C) where x is bound by λ and the local variable constructor is a higher-order function. These transformations are standard.
To give focus to our presentation we will use a particular language as a case-study. It combines a functional fragment based on the lambda calculus with the simple imperative language of locally-scoped block variables, iteration and branching. Its data type are finite (fixed size) integers. This language is a simplified version of Reynolds's Syntactic Control of Interference (SCI) [16] .
The primitive types of the language are commands, memory cells and (boolean) expressions: σ ::= com | var | exp. Additionally, the language contains function types and products:
What is peculiar about the types above is that pairs of terms may share identifiers but functions may not share identifiers with their arguments. Terms have types, described by typing judgements of the form Γ M : θ, where Γ = x 1 : θ 1 , . . . x n : θ n is a variable type assignment, M is a term and θ the type of the term. The syntax of the language is formally specified by the following rules:
The language contains the standard constructs for structured state manipulation and control. It is convenient to present these constructs in a functional form:
Product has syntactic precedence over arrow, which associates to the right. For example, a term with conventional syntax, such as integer x; x = 0; if (x <> y) x = x + 1; would be written as newvar(λx. seq(asg(x, 0), if(neq(der x, der y), asg(x, add(der x, 1)), skip))).
The interesting structural feature of the type system SCI is allowing sharing of identifiers (contraction) in productformation but disallowing it in function application. Reynolds was interested in this rule to eliminate covert interference between terms that ostensibly do not share identifiers, hence the name. This type system facilitates reasoning about program correctness and it has eventually led to the development of bunched [17] and separation logic [18] .
This restriction can be exploited in several ways, as noticed in the different type signature on sequential (uncurried) versus parallel (curried) composition. This makes terms such as λx.x; x are legal, while λx.x || x is illegal. Another consequence of this restriction is that nested function application as in λf λx.f (f (x)) is also banned.
A programming language such as SCI is such that every program can be written out of a set of functional constants glued together using function application. Above we have seen how currying and application are defined in a monoidal (or compact) closed setting. The next question we must address is how to concretely interpret programming language types and constants.
It is a well-known fact that relying on the primitive concepts of channel, event and communication makes process calculi good intermediate abstractions for hardware, and refining processes into hardware-level representation has been extensively studied [19] . On the other hand, there has been significant research work regarding the encoding of (prototypical) functional programming languages into process calculi [20] .
Game Semantics is a process-calculus-like model for programming languages introduced in the mid-90s, which proved to be extremely successful at giving precise interpretations for a variety of languages, thus solving long-standing open problems in the theory of programming languages [21] . Process calculi are versatile but have little structure, whereas game semantics encapsulates the right mathematical structure needed to interpret programming languages. Like process calculi, it is event-oriented and can be refined into hardware. Hardware compilation of programming languages via game semantics is a natural approach based on solid foundational results.
In its most concrete formulation, game semantics can give trace-like interpretations for many programming language constructs. This is the style in which we will use it below. Moreover, the mathematical structure of game semantics is closed monoidal, therefore compatible with the functional interface model we are using here.
A. Synthesis of SCI
Each type corresponds to a circuit interface, defined as a list of ports, each defined by data bit-width and a polarity. Every port has a default one-bit control component. For example we write an interface with n-bit input and n-bit output as I = (+n, −n). If a port has only control and no data we write it as +0 or −0, depending of polarity.
An interface for type θ is written as θ , defined as follows:
The interface for com has two control ports, an input for starting execution and an output for reporting termination. The interface for exp has an input control for starting evaluation and data output for reporting the value. Variables var have data input for a write request and control output for acknowledgment, and control input for a read request along with data output for the value. Diagrammatically, a list of ports will correspond to ports read from left-to-right and from top-to-bottom. We indicate ports of zero width (only the control bit) by a thin line and ports of width n by an additional thicker line (the data part). For example a circuit of interface com → com = (−0, +0, +0, −0) can be written in any of these two ways:
The unit-width ports are used to transmit events, represented as the value of the port being held high for one clock cycle. The n-width ports correspond to data lines. We will work under the assumption that the event on the unit port is a control signal indicating the data on the data line is valid.
Below, for each language constant we will give the gamesemantic interpretation expressed in a synchronous trace-like representation [22] along with the (obvious) circuit implementation. We only explain the semantics intuitively. For the technical details the interested reader is referred to the literature, as discussed in the conclusion.
For the purpose of giving the semantics and its representation the circuit interface will correspond to the type; the semantics is a set of traces over the ports of the interface, i.e. a set of waveforms where each event is a port being set to logical value high. We denote an event on port k of interface I = (p 1 , . . . , p m ) by n k , where n is the data value; if the data-width is 0 we write * k . We use m, m to indicate the simultaneity of events m and m . We use m · m to indicate concatenation, i.e. event m happens after m, but not necessarily in the very next clock cycle. We define m • m = { m, m , m · m }. The semantic interpretation is given by a valuation − s .
The simplest ground-type constant is skip : com, with interface com = (+0, −0). Its interpretation is skip s = { * 1 , * 2 } and the circuit representation is:
SKIP
Intuitively the input port of a command is a request to run the command and the output port is the acknowledgment of successful completion. In the case of skip the acknowledgment is immediate.
The other ground-type constant is integer k : exp, exp = (+1, −n). The semantic interpretation is k s = { * 1 , k 2 }, with obvious circuit implementation N Intuitively the input port of an expression is a request to evaluate the expression and the output port is the data result and a control signal indicating successful evaluation. In the case of a constant n the acknowledgment is immediate and the data is connected to a fixed bit pattern.
Sequential composition seq : com × exp → exp has interface com × exp → exp = (−0, +0, −0, +n, +0, −n) and interpretation seq s = { * 5 ,
The circuit representation is:
Above, D denotes a one-clock delay (D-flip-flop). A sequencer SEQ propagates the request to evaluate a command in sequence with an expression by first sending an execute request to the command, then to the expression upon receiving the acknowledgment from the command. The result of the expression is propagated to the calling context. Note the unit delay placed between the command acknowledgment and the expression request. Its presence is a necessary artefact of correctly representing asynchronous processes synchronously and cannot be optimised away even if it would result in a circuit with lower latency. Assignment and dereferencing are asg : var × exp → com var × exp → com = (−n, +0, −0, +n, −0, +n, +0, −n)
der : var → exp, var → exp = (−n, +0, −0, +n, +0, −n)
The circuit representations are, respectively:
The variable type has four ports: writing data (n bits), acknowledging a write (0 bits), requesting a read (0 bits) and providing data (n bits). Assignment is a sequencing of an evaluation of the integer argument with a write request to the variable; the unused variable ports are grounded. Dereferencing is simply a projection of a variable interface onto an expression interface by propagating the read-part of the interface and blocking the write part. Arithmetic and logic operators have the generic definition
R above is a register, with "load" (middle) and "reset" (left) input ports. The (combinatorial) circuit OP implements the operation. Note that the value of the first operator is saved in the register because expressions can change their value in time due to side-effects. Branching is defined and interpreted as
The corresponding circuit is:
Above, Mux is a (combinatorial) n-bit multiplexer which selects one data path or the other depending on the control signal. X is a merge of two control signals (or or exclusive-or) and T is a de-multiplexer which propagates the input control signal to the first or second output, depending on whether the data value is zero or nonzero. As before, the delay D is necessary for correctness considerations.
The final typical imperative construct, iteration, is
The circuit is:
The iterator issues an evaluation request to the first argument, which is an expression and, if the result is zero, it will evaluate the second argument then repeat the cycle until the first argument returns a non-zero value.
The local variable binder is a higher-order constant.
The circuit with this behaviour is basically just a register:
The middle port on the port is the "load" port and the rightmost one is the "reset". The "glue" that puts terms together into larger terms is application, which is interpreted in the standard way in a closed monoidal category:
Diagrammatically this is:
Finally, we need a rule for identifiers, x : A x : A = id A . Now we can compile higher-order functional and imperative programs into hardware. Moreover, these programs can be open, i.e. with missing components. These components must only be provided in the second phase of synthesis, which corresponds to linking in a conventional compiler. Take, for example, a program for doing in-place map on a data structure equipped with an iterator provided in a separate library. The interface of the iterator is init : com to initialise the iterator, curv : var to write to the current data location, curr : exp to provide the data store in the current location, next : com to advance to the next location and more : exp which returns 0 only if the end of the data structure has not been reached. In-place map is: 
V. ACCESS PROTOCOLS FOR HARDWARE FIMS
In this section we will discuss the key issue of sharing, which in the programming language manifests via reuse of identifiers. Notice that in the previous example each identifier was used precisely once. Such programs are said to be linear, and their expressiveness is below that of conventional programs. For example programs such as x := x + 1 are not linear because x is used twice.
Unlike functional interfaces, which arise canonically and statically out of the diagrammatic structure of hardware, sharing must be added dynamically. The abstract categorical framework is still helpful, because it gives us canonical criteria for evaluating whether sharing is correctly implemented. In other words, it gives an abstract specification.
Sharing corresponds to the categorical notion of Cartesian product. This can be defined in several way, but for us the starting point is the monoidal tensor ⊗. When does it behave like a product? The answer to this question was elaborated in the original Geometry of Synthesis paper [23] and we shall quickly revisit it here.
The first requirement is a global one. For any interface A, all circuits of shape f : A → I must behave the same. We remember that I is the trivial, empty, interface. Note that this is immediately incompatible with compact closed structure, where the notion of "left" and "right" of the diagram are only conventional. However, this is compatible with closed monoidal structure. We enforce this property via a protocollike convention, saying that in any interface there must be a special port called initial, and is always situated on the right of the arrow. Indeed, if we examine the behaviour of all constants in the previous section the first signal is always an input and always on the right-side of the arrow. Thus, a circuit of shape A → I can never be activated because the initial input must come from the right side, which has no ports! We write this circuit as ! : A → I and we draw it as
A I !
The second requirement is the existence of the so-called diagonal circuit: for any interface A a circuit δ A : A → A⊗A, having the projection property: δ A ; (! ⊗ id A ); α = id A , where α : A ⊗ I A. Diagrammatically, this equation means that the circuit below should always behave just like the identity:
Intuitively, a diagonal is meant to share one circuit of interface A from two places. However, if one of these places is not actually used and all its ports are grounded then the resulted circuit should behave just like an identity on A. This suggests the following implementation for δ A . We know that the first event will be an input on one of the initial ports in the A's on the right. The diagonal circuit should remember this information in internal state, then relay the inputs and outputs between that A and the A on the left. Whenever a new initial event on the right is recorded that information is reset and the relay resumes between the shared occurrence on the left and the A which last issued an initial event. For the simplest interface, com, the diagonal is:
SR is a set-reset flip-flop which can work in pass-through mode if low latency is required.
Note that the requirement that the first move is initial is essential in proving that the projection requirement is actually met by every diagonal at every type.
The third requirement is about the interaction between the diagonal and every other circuit, and it is an inlining property. Formally, this property is expressed by the equation f ; δ B = δ A ; (f ⊗ f ), which is represented diagrammatically by the equality of these two diagrams:
Intuitively, this expresses the fact that one shared instance of f should behave exactly like two "inlined" copies of f . Note that in the case of the inlined copies, their free identifiers, which are represented by the ports on the left, have to be now shared.
In order to validate this requirement a further constraint must be added to the protocol representing the behaviour of a valid f . To each initial input event must correspond a final output event and, moreover, after the issuing of the final output event the circuit must return to its initial state. In other words, all circuits must be self-resetting. Indeed, by inspection we can see that all circuits we use are self-resetting.
The reason why non-self-resetting circuits cannot be shared is quite clear from the diagram. Suppose f counts its inputs (received on the right) and outputs after each number its state (on the left). The shared f will receive twice the number of inputs that the two "inlined" copies, so the behaviour cannot be the same.
In order for the compiler to correctly implement sharing, a diagonal circuit with the projection property must exist and each circuit must have an input-output behaviour consistent with a particular protocol, which can be described informally as having two properties:
Init. Each circuit must have designated initial inputs which must be activated first. Reset. Each circuit must have designated final outputs which must return the circuit to the initial state. It is easy to see that all basic circuits have this property. However, a very important property of this protocol is that it is compositional, i.e. legality is preserved by serial and parallel composition. If f : A → B, g : B → C are legal circuits then f ; g is a legal circuit and if f : A → B, f : A → B are legal circuits then f ⊗ f is a legal circuit. Compositionality ensures that we can arbitrarily compose circuits, starting from the basic ones, without ever violating the protocol.
The concept of input-output behaviour governed by a compositional protocol is the basic idea of game semantics. In the parlance of game semantics an event is a move, with inputs being moves of the opponent and outputs of the proponent. A trace of events is a play and the protocol governing the legal behaviour is a general notion of legal play, i.e. the rules of the game.
We can now add an interpretation of sharing, which in SCI is via product, to our compiler:
Diagrammatically this is:
In the compiler we can now compile programs with sharing such as λf λx.f x; f x : com, which yields the sub-circuit framed in the diagram below. F and X are the bodies of f and x which must be supplied at link-time to yield an executable circuit.
). The n-way arbiter then ensures that the sequential contraction circuit is actually used sequentially, because all read and write requests to the variable are serialized. The two XOR gates take mutually exclusive input events, therefore they are always used safely.
This, together with the correctness of type inference from ICA to SCC (Thm 4.1) and program transformation from SCC to SCC(1) (Thm. 5.1) lead to the main result, THEOREM 6.2. Programs in ICA which have an SCC type can be effectively mapped into delay-insensitive event-logic circuits.
Example
We show the compilation of two terms with identical ICA types but distinct SCC and serialised versions, λf x.f (fx) and λf x.f x; fx. Assuming that f : com 1 → com, the two terms have SCC types (com 1 → com) 2 → com 1 → com and, respectively (com 1 → com) 1 → com 1 → com. The respective SCC(1) transformations give, respectively, λf1f2x.f1(f2x) and λf x.f x; fx. The two compiled circuits are in Fig. ? ?.
7. Further work * related work: "linearization"? * make synchronous via round abstraction model (Def. 2.7 in [5] ). The n-way arbiter then ensures that the sequential contraction circuit is actually used sequentially, because all read and write requests to the variable are serialized. The two XOR gates take mutually exclusive input events, therefore they are always used safely.
We show the compilation of two terms with identical ICA types but distinct SCC and serialised versions, λf x.f (fx) and λf x.f x; fx. Assuming that f : com 1 → com, the two terms have SCC types (com 1 → com) 2 → com 1 → com and, respectively (com 1 → com) 1 → com 1 → com. The respective SCC(1) transformations give, respectively, λf1f2x.f1(f2x) and λf x.f x; fx. The two compiled circuits are in Fig. ? 
VI. ADVANCED ISSUES
SCI is a remarkably expressive imperative and functional programming language which is yet compilable into static (RAM-free) circuits. One of the main restrictions of SCI is that programs such as f (f (x)) cannot be typed, because identifiers (f in this case) can only be shared in product formation and not in function application.
Assuming for simplicity that f : com → com, to see why the (illegal) circuit for f (f (x)) malfunctions consider what happens when we apply (λf λx.f (f (x)))(λc.c; c)skip, as synthesised in the circuit below. The flow of information through the diagonals is indicated with dashed lines to simplify interpretation.
The flow of information will reach the port X of δ com→com three times. The first time will take path (1), second time (2) and third time (2) again, closing an infinite causal loop through the circuit and causing divergence. The diagonals are not designed to handle nested interactions and their behaviour if used inappropriately is undefined, arising out of contingencies of their implementation. The type system also restricts sharing of identifiers in parallel contexts, as in λc.c||c, for similar reasons.
A. Serialization
In larger programs any restriction on sharing is awkward. The only way to overcome this restriction is by manually replicating identifiers.
Ideally, we want to allow the programmer to write programs such as λf λx.f (x); f (x) or λf λx.f (x)||f (x). However, the type system which forms the basis of the programming language allows sharing of identifiers only in sequential contexts, hence the first term is typeable but the second is not. The programmer needs to serialize the term by hand, writing something like λf 1 f 2 λx 1 x 2 .f 1 (x 1 )||f 2 (x 2 ); in complicated terms this quickly becomes very difficult to handle. This becomes even more difficult if we want to apply this function to terms which themselves require transformation. For example, if applied to λc.c||c, which needs itself to change to λc 1 c 2 .c 1 ||c 2 , the original function should actually
Serialization [24] is a technique for the automatic conversion of programs with unrestricted sharing into SCI via systematic replication. The first step is trying to automatically type-check the program using a resource-oriented type system called Syntactic Control of Concurrency (SCC), which is a generalisation of SCI [25] . In SCC function arguments are decorated with integers indicating how many times the function uses it in a concurrent or nested way. By solving a constraint system these bounds can be automatically calculated.
Consider for example the program λf λx.f (f (x)). Suppose we assume the type of f to be com 2 → com, i.e. function f can use its argument in two concurrent contexts. Solving the system of constraints gives the following typing:
This means that the term will use f in at most 3 non-sequential contexts and x in 4. The next step is the systematic replication of each identifier according to the number of times it is used concurrently. Informally, each argument of f must be duplicated, so each f x must become f x 1 x 2 . But f x is itself an argument of f , so itself must be duplicated, resulting in the term λf 1 f 2 f 3 x 1 x 2 x 3 x 4 .f 1 (f 2 x 1 x 2 )(f 3 x 3 x 4 ), which is SCI-legal and has the same behaviour as the original term. The formal algorithm is more complicated and described in loc. cit. .
Note that serialisation always favours sharing over replication. However, the inlining property of sharing says that replication is always an option. Indeed, the sharing circuitry (the diagonal) can be fairly expensive and for small functions it might be preferable to replicate rather than share. Almost all performance parameters can be calculated at compile-time, e.g. footprint (number of gates) or latency (longest delay) and complex optimisation strategies can be implemented to meet set design targets.
B. Recursion
Since recursion is operationally unfolded into nested application, SCI does not support unrestricted recursion. However, a fix point operator can be applied to a closed term, i.e. a function without free identifiers:
A Recursion can be supported in a very restricted way by explicitly unfolding a computation in space. This approach can only be taken if the recursion has a fixed, known, and relatively small depth. This style of recursion is supported by Lava.
A more general type of recursion can be supported by unfolding the computation in time. As a first step towards implementing recursion we replace all occurrences of registers R and SR with indexed versions Ri and SRi, with i being the address for both the register load and the read:
Such state can be automatically synthesised into block-RAM by standard FPGA tools, therefore it is highly distributed and low-latency and it presents no computational bottleneck. An unfolding of the fix-point can be conceptualised like this:
R NEWVAR
In addition to the constants of the language we also interpret structural rules (abstraction, application, product formation) as constructions on circuits. In diagrams we represent bunches of wires (buses) as thick lines. When we connect two interfaces by a bus we assume that the two interfaces match in number and kind of port perfectly.
In general a term of signature x1 : θ1, . . . , x k : θ k M : θ will be interpreted as a circuit of interface θ1 × · · · × θ k → θ .
Abstraction Semantically, in both the original game semantics and the synchronous representation the abstraction Γ λx : θ.M : θ is interpreted by the currying isomorphism. Similarly, in circuits the two interfaces for this circuit and Γ, x : θ M : θ are isomorphic.
Application To apply a function of type Γ F : θ → θ to an argument ∆ M : θ we simply connect the ports in θ from the two circuits:
Application To apply a function of type Γ F : θ → θ to an argument ∆ M : θ we simply connect the ports in θ from the two circuits: Every instance of F uses a different index, but is otherwise identical. This means that we can replace the fixed indices with a counter and fold all the instances into one single instance, indexed by the counter. The value of the counter will indicate what "virtual" instance of F is active and will be used as an index into the registers. The fix-point circuit will increase this global counter whenever a recursive call is made and decrease it when a recursive return is made. When the counter is 0 the recursive return will be to the environment. circuits the two interfaces for this circuit and Γ, x : θ M : θ are isomorphic.
Application To apply a function of type Γ F : θ → θ to an argument ∆ M : θ we simply connect the ports in θ from the two circuits: Application To apply a function of type Γ F : θ → θ to an argument ∆ M : θ we simply connect the ports in θ from the two circuits: Semantically, in both the original game semantics and hronous representation the abstraction Γ λx : θ.M : erpreted by the currying isomorphism. Similarly, in the two interfaces for this circuit and Γ, x : θ M : θ orphic.
To apply a function of type Γ F : θ → θ to an t ∆ M : θ we simply connect the ports in θ from circuits: nstants of the language we also interpret strucon, application, product formation) as construcdiagrams we represent bunches of wires (buses) we connect two interfaces by a bus we assume es match in number and kind of port perfectly. of signature x1 : θ1, . . . , x k : θ k M : θ s a circuit of interface θ1 × · · · × θ k → θ . bstraction, application, product formation) as construcuits. In diagrams we represent bunches of wires (buses) s. When we connect two interfaces by a bus we assume interfaces match in number and kind of port perfectly. al a term of signature x1 : θ1, . . . , x k : θ k M : θ preted as a circuit of interface θ1 × · · · × θ k → θ . Semantically, in both the original game semantics and hronous representation the abstraction Γ λx : θ.M : erpreted by the currying isomorphism. Similarly, in the two interfaces for this circuit and Γ, x : θ M : θ orphic.
To apply a function of type Γ F : θ → θ to an t ∆ M : θ we simply connect the ports in θ from circuits: More details of the implementation of recursion including testing a recursive implementation of a Fibonacci sequence generator are discussed in [26] .
VII. CONCLUSION
In this tutorial we have reviewed the Geometry of Synthesis technique for compositional higher-level synthesis. The main idea was introduced in [23] and further refined in [27] . Both these papers discussed the synthesis of asynchronous circuits. The technique of serialisation discussed earlier was also presented in the context of the synthesis of asynchronous circuits [24] .
The reason for this early focus on the synthesis of asynchronous circuits was that the underlying game semantic models used in GOS [25] , [28] are asynchronous and therefore map naturally into asynchronous circuitry. A naive mapping of asynchronous into synchronous circuits is possible, but is expensive and leads to high-latency circuits. The issue of producing correct low-latency circuits was settled in [29] , [22] . The most recent development is the handling of recursion [26] , which also introduces the sequential representation discussed here.
At this stage we believe that all the theoretical requirements for a modern higher level synthesis compiler are in place and a prototype compiler is being developed. The next step will be a focus on compiler optimisations, including automated parallelisation, but especially language support for pipelining via types.
