Static multi-issue machines, such as traditional Very Long Instructional Word (VLIW) architectures, move complexity from the hardware to the compiler. This is motivated by the ability to support high degrees of instruction-level parallelism without requiring complicated scheduling logic in the processor hardware. The simpler-control hardware results in reduced area and power consumption, but leads to a challenge of engineering a compiler with good code-generation quality.
INTRODUCTION

Static multi-issue machines such as traditional Very Long Instruction Word (VLIW)
architectures move complexity from the hardware control logic to the compiler [Fisher 1983 ]. The compiler-oriented architectures are motivated by their ability to support high degrees of instruction-level parallelism without requiring a major part of the chip area dedicated to a complicated scheduling logic in the processor hardware. The simpler control hardware results in reduced area and power consumption, but leads to a challenge of engineering a compiler that can utilize the processor efficiently.
Transport Triggered Architectures (TTAs) and other so-called exposed datapath architectures [Jääskeläinen et al. 2015] reveal even more details of the datapath under the control of the software. For TTA, the main benefits are in reducing the register file pressure, which allows more energy-efficient and scalable designs due to the reduced general-purpose register file bottleneck. Clearly, the exposed datapath adds even more complexity to the compiler side due to the additional scheduling freedoms. The challenges to exploit the datapath automatically from higher-level software descriptions has in part hindered the appeal of these machines, despite their excellent potential for low-power, high-performance designs.
The code-generation phase in compilers for exposed datapath cores is commonly implemented using heuristic methods as they produce results in a reasonable time. The quality of the produced code, however, is usually suboptimal. Code generators based on mathematical models can be used to produce optimal results, but they are notorious for their superpolynomial complexity, leading to compile time explosion with larger programs. Therefore, although the computational power that can be utilized for solving the models has increased and keeps increasing dramatically, it is still the case that mathematical models are used only for scheduling hot spots of the program, such as heavily executed inner loops. In order to make the compilation time feasible, the less frequently executed parts can be processed with a heuristic code generator, as their code quality is not as significant to the total execution time.
In this article, we propose an Integer Linear Programming (ILP)-based mathematical model that supports the unique scheduling freedoms presented by the TTA programming model. To the best of our knowledge, this is the first publication of such a working model. We show that the proposed model can provide significant improvements compared to a scheduler based on heuristics, with the compilation time still being reasonable for production-code generation. The contribution has major benefits due to the challenges of efficiently compiling high-level programs for irregular processors with a high degree of programmer control, such as reduced connectivity network TTAs.
The rest of this article is organized as follows. In order to understand its additional challenges regarding the scheduling problem, Section 2 introduces the TTA and its programming model. Section 3 describes the proposed ILP model. The model is evaluated in Section 4 using two distinctly different TTA machines and a set of benchmark programs. Previous similar code-generation work is revised in Section 5. Some of the planned future work is discussed in Section 6 before concluding the article in Section 7.
TRANSPORT TRIGGERED PROGRAMMING MODEL
The classical VLIW is a processor architecture style that is designed to take advantage of instruction-level parallelism in programs without requiring complex control unit hardware by making the parallelized operations explicit to the programmer [Fisher 1983 ]. However, there are scalability issues with the common VLIW style; when the number of function units is increased to support more instruction-level parallelism, the interconnection network and the register file complexity of a VLIW processor (due to the need to assume worst-case register file accesses) grows dramatically. A more complex bypass network and the large number of register file ports lead to increased power consumption, increased chip area, and reduced clock frequency [Hoogerbrugge and Corporaal 1994] .
TTA is a processor design style that has been proposed to alleviate some of the bottlenecks of VLIW-style processor designs [Corporaal 1995] . TTA generalizes VLIW in a sense that the interconnection network is visible to the programmer, which even further moves complexity from the hardware to the compiler. As a side effect, it alleviates the interconnection network and register file bottleneck by means of a more feasible interconnection (or bypass) network customization and reduced number of simultaneous general-purpose register file accesses. This translates to energy savings in comparison to traditional VLIW designs due to general-purpose register accesses often consisting of a significant part of the total power consumption.
In the TTA processor template targeted by the model in this article [Cilio et al. 2006 ], the TTA interconnection network is composed of sockets that are attached to buses. Sockets can be either unidirectional or bidirectional depending on the ports attached to them. Figure 1 Each unit port has a socket attached to it; these sockets are further attached to one or more buses. In Figure 1 , the socket-bus connections are illustrated with circles over the intersection of a bus and a socket. The arrows at the sockets indicate the direction of data transfers. Although the example TTA in the figure has only one ALU, a single LSU, and a very simple RF with only two ports, the TTA processors supported by the referred template can have an arbitrary number of computational resources to support different degrees of instruction-level parallelism available in applications of interest. Furthermore, the connectivity and the number of buses is also completely customizable in the processor template, imposing an additional challenge to the proposed model.
As TTAs are in-order processors without instruction-scheduling hardware, the control unit fetches instructions only from memory, decodes them to control signals, and implements control flow affecting operations such as call and jump.
Although the TTA approach clearly has it benefits, TTAs are not considered the best option for general-purpose operating systems or for reactive applications because exceptions and fast interrupts are expensive to implement in TTAs due to the abundance of programmer visible architecture context data. The authors have found the best uses for TTA-style processors in customized programmable designs that execute data-oriented kernels and might have high performance requirements combined with a strict power budget, or at the another end of the spectrum; ultra low-power IoT scenarios that need some computational performance. For handling exceptions and fast interrupts, other style cores can be used either in an assisting role to a TTA master, or in a more traditional host-slave offloading setup with a general purpose core as a master.
For these reasons, there are no fabricated TTA general-purpose processors available in bulk, but TTAs are used as accelerators in customized SoCs, of which internals are often company secrets. In addition to various academic TTA designs, the authors know about two that are advertised in commercial products [KDPOF 2014; Maxim Integrated Products, Inc. 2004] , and of multiple designs fabricated to test chips but not yet sold or published. Furthermore, several commercial designs contain other variations of the "exposed datapath" concept that can benefit from the model described in this article [Jääskeläinen et al. 2015] .
Programming Model
The VLIW processor instruction stream is explicitly parallel, consisting of sets of operations that are executed at each cycle. The operations describe the operation codes to execute, the input operands, and the result register as general-purpose register indexes. The hardware then decodes these instructions to control signals that perform movements of data in the interconnection network of the datapath at correct time slots.
The programming model of TTA increases the visibility of the datapath to the programmer by making the interconnection data transports explicit. TTA programs consist of instructions that describe parallel data transports, later referred to as moves, between the units on the interconnection network. In this programming style, the concept of a bus is tightly bound to the concept of a move slot in the instruction word. A bus can transfer a single value between two connected sockets at each instruction cycle. The source and destination of the transfer is encoded in the instruction word in a move slot that controls a single bus.
A program operation is divided into a set of operand and result moves. The actual operations are launched as a side effect of these data transports into the units, hence the name transport triggered. In the used TTA template, exactly one of the operand moves is a trigger move that starts the execution of an operation in a function unit.
The template supports constant values or immediates in two ways: encoded into the move operation source field (short immediates) or encoded to occupy one or more move slots in the instruction word (long immediates). The latter is controlled using separate template bits in the instruction word, which also describe the destination Immediate Unit (IU) in which the decoder should transfer the constant when encountering the instruction in the stream.
As a concrete example of the programming model, let us consider a program operation of two inputs ADD x,y → z. It is described as a TTA program by means of two operand moves: x → ALU.in and y → ALU.t.ADD, and a single-result write move ALU.out → z. This naming convention is typical when working with the referred TTA template: in (1 . . . n) are input ports, t is the trigger port, and out (1 . . . n) are output ports for reading the results.
Input operands of the addition operation are transported through the connections provided by the interconnection network to a compatible unit. When compiling from a high-level language, the selection of which function unit to use for the operation is the responsibility of the instruction selector or, as in our case, the instruction scheduler.
After the operation latency, the result is read from the ALU unit and moved to the register z. The latency depends on the implementation of the operation in the chosen FU. For example, it can be a multicycle implementation, a pipelined multicycle implementation, or a single-cycle implementation. In addition, the operation can share resources with other operations in the same FU, implying additional scheduling constraints.
An additional challenge that was placed to the proposed instruction-scheduler model is that the timing of the result reads is explicit in the used template; reading an FU port before the result is ready causes reading of the previous value, and reading it too late returns a more recently computed value.
TTA-Specific Optimizations
The TTA programming model allows unique software optimizations. Perhaps the most well-known is software bypassing, which is an optimization that is traditionally performed at runtime by the processor hardware's register file bypassing or "forwarding" logic in the case of dynamic processors [Corporaal and Hoogerbrugge 1995] . In software bypassing, the program transfers a result of an operation directly from an FU producing it to the inputs of FUs that need the value. This alleviates the pressure placed on the general-purpose registers, especially the ports of the register files. Furthermore, bypassing may eliminate some false dependencies between operations that otherwise would need to share a general-purpose register, increasing the instruction-level parallelism.
If all the reads of the result can be bypassed to the destination by the program, software bypassing may result in an unnecessary register file-write move. This removal of result moves to general-purpose registers that are never read is called Dead-result (move) elimination optimization.
For example, consider a TTA program containing two dependent operations ADD x,y → z and SUB z,a → b. The latter operation uses the result of the former operation, thus the latter depends on the former. These subsequent moves produce the following TTA moves:
If the machine has a connection between the ALU.out and ALU.in, the result of the addition operation can be bypassed directly to ALU.in. After the bypass, move ALU.out → z can be eliminated, as there are no other uses for the result:
Apart from handling software bypassing and dead result elimination, the TTA programming model presents the timing of the operand and result transfers as a new scheduling freedom. While the timing of data transfers is fixed in traditional VLIWs, the TTA model allows choosing the timing for the transfers freely, which further reduces the RF port bottleneck, but presents yet another parameter that must be controlled efficiently using the scheduling model.
ILP FORMULATION FOR TTA INSTRUCTION SCHEDULING
In this section, we introduce the basics of the involved ILP concepts; more detailed discussion can be found from Rossi et al. [2006] . Then, we describe the proposed ILP model, focusing on the challenges presented by the unique degrees of freedom in programming TTAs.
Integer Linear Programming Fundamentals
where S j is a relation on the variables.
A feasible solution to a CSP is an assignment x of values to the variables in X, where the assigned values appear in their respective domains D, that satisfies the constraints in C. There might be no solution, a unique solution, or multiple solutions. Typically, the applications are underconstrained, that is, they have multiple solutions that can be arranged according to some property of the solution. These kinds of problems are called Constraint Optimization Problems (COP) . In addition to the constraints, they have an objective function f that is to be either minimized or maximized. An optimal solution x * is a feasible solution such that, for any other feasible solution
given that the objective function is to be minimized.
ILP problems are a subset of constraint satisfaction problems in which the variables X are restricted to be integers, and the objective function f as well as the constraints D are linear. Moreover, 0 -1 integer programming is a special case of ILP problems in which the variables are required to be binary. ILP has been applied widely to real-world problems such as operations scheduling, artificial intelligence, and resource allocation [Jünger et al. 2010] .
Constraint satisfaction problems are typically NP-hard [Mackworth 1977] , and many important constraint satisfaction problems have been proven to be NP-complete [Rossi et al. 2006] . In particular, ILP is NP-hard [Jongen et al. 2004 ] and 0 -1 integer programming is NP-complete [Karp 1972 ]. Therefore, compiler algorithms utilizing ILP are not generally applicable, but they are usually used for only the most important parts of the program with feasible size.
Constraint satisfaction problems are typically solved using some sort of search exploring all the possible assignments of the variables. Typical techniques are backtracking and constraint propagation. There are numerous algorithms to solve integer linear problems exactly. A popular solver algorithm is the Branch and Bound (BB) algorithm [Land and Doig 1960] , which is also used in our work. Today, there are sufficient commercial ILP solvers available (e.g., CPLEX, Gurobi, and XPress optimization suite), which promote the use of ILP in code optimization.
Model Formulation
The local instruction scheduler is modeled as a 0 -1 ILP problem. The input to the scheduler consists of sequential operation moves generated from the program operations in the compiled high-level language software, and a Data Dependency Graph (DDG) built from the moves belonging to a basic block. The sequential intermediate representation has registers preallocated to physical machine registers, and it uses only operations that are found in the target machine, but it does not yet have the operations assigned to function units, nor does it use software bypassing or exploit instruction-level parallelism. The input format can be thought of as targeting a simple virtual scalar TTA machine that can trigger one instruction at a time. The DDG includes true and false dependencies due to register usage and memory accesses as edges connecting nodes that bundle together moves belonging to program operations such as ADD and MUL.
Initially, for each move, there may exist multiple destination and source ports. Similarly, there might be more than one connection between a pair of destinations and source port on the interconnection network. Immediate values have only a destination port and an assigned bus. After scheduling, this sequential intermediate format must have all the moves assigned to the interconnection network onto a single cycle (instruction), a connection, and consequently the operations mapped to function units.
In order to integrate the software bypassing decision to the model, in addition to the moves generated from the compiled program, a bypass move is created for each result move that is a candidate for bypass optimization given the interconnection network. This move, the bypass move, is generated by taking the source from the bypass candidate move and the destination from the bypass result move.
The decision variables of the model specify the connection to be used and the cycle the move is to be executed. We associate an indicator variable M i,t,c = 1 if move i is assigned to connection c at cycle t, 0 otherwise, to describe a possible move assignment, indicated as a Boolean value. The interconnection network confines a set of possible connections C i for each move i. For each move, there exist variables for all possible connections C i and cycles in the range [t min , t max ]. This range depends on the input program and the TTA machine in question. The length of the range greatly increases the model size and, in turn, the required solving time. A decent outcome can be achieved by setting the range from the DDG critical path length to the heuristically scheduled program cycle count. Let P be the set of all moves in the considered TTA program.
Constraints
The following constraints are used in the proposed model to produce valid TTA schedules.
All Non-Bypass Related Moves Must be Assigned. All the moves that are not bypass moves, move candidates for bypassing, or bypass result moves must be assigned exactly once to a connection and a cycle:
∀i :
Dependencies Between Moves. The DDG defines the legal orders for execution of moves in a given program. Let M i,c,t be an indicator associated with an arbitrary move and M D i ,c ,t be its source dependence. The following constraint must be satisfied
The constraint requires for each t that the dependent move must not be executed within time interval [t min , t + l], where l is latency of the executed operation. Bypassing and Dead-Result Elimination. The temporary bypass moves generated from the bypass candidate moves are alternative to each other. The bypass move inherits the dependencies from the source move, and each move that is dependent on the bypassed move is dependent on the bypass move as well, except for the possible register antidependencies that do not exist in the bypassed case. Figure 2 presents a data dependence graph with an added bypass move, highlighted by the dashed line. The bypass source ALU.out → z and bypass candidate move z → ALU.t.SUB are emphasized with the dashed-dotted polyline. The bypass move contains the incoming edges of the bypass source ALU.out → z and, respectively, the outgoing edges of the bypass candidate z → ALU.t.SUB.
Dead-result elimination allows the removal of the result move in the case that a move is bypassed to all the consumers of the result. For example, in the situation presented in Figure 2 , the result move ALU.out → z can be eliminated, as the only outgoing edge is to the bypassed move z → ALU.t.SUB.
In summary, bypassing together with dead-result elimination results in two possible cases: (a) the result move has only a single outgoing edge. In this case, either the bypass move (bypass and eliminate), or the bypass candidate and the result move shall be assigned (no bypassing or elimination) and (b) the result move has multiple outgoing edges. In the latter case, the result move is assigned and the bypass candidate and the bypass move are an alternative to each other. In the case that all the bypass moves are assigned, the result move can be eliminated postsolving. This case leads to latency reduction when moves are bypassed, and energy savings in the case that the result move can be eliminated. This slightly limits the scheduling freedom, but makes the model more understandable and less demanding in terms of solving complexity. 
where M candidate , M bypass , and M result are the bypass candidate, the bypass move, and the result move, respectively. The latter condition requires two constraints. 
Also, the result moves have to be assigned, yielding
Register File Port Constraints. Register files have a number of registers that can be read through output ports. Even if an output port was connected to multiple buses, only one register can be read through a single port at the same cycle. In other words, the number of output ports bounds the number of concurrent register reads from a register file.
Similarly, each input port can write into a single register at a time. To limit the outbound moves at each cycle to one for each port, we require that
where R is a set of all RF ports on the given TTA processor, and C rf is a set of all connections that originate from a register file port rf.
Function Unit Constraints. Operations executed at function units consist of multiple operands, of which one is a triggering operand that starts the operation execution. All other operands shall be written to FU input ports at any cycle before or at the same time as the triggering operand. A trigger move constrains the latest cycle on which other operands must be written. There might be multiple function units that can execute each operation, and each operand might have multiple connections to an FU. Therefore,
where FU is a set of all function units that can execute the considered operation, C f is a set of all connections that can transport a given move to an appropriate port on f, M trigger is the trigger move, and O is a set of operand moves that relate to the triggering move M trigger on the FU of the port f. The variable e is the so-called operand slack limit, Fig. 3 . Function unit port reservations until the triggering operand is written. The first operand, FU.in, reserves the corresponding input port until the execution is triggered to start from port FU.t (illustrated with the blue line).
a model parameter that restricts the operand move transport to occur at most e cycles before the triggering move to limit the model size.
No other operand moves must be written to operand ports between cycles [t , t], where t and t are cycles in which the triggering move and the operand move is being written, respectively. In other words, an operand port is reserved to the program operation until the triggered execution starts after the trigger operand is written. This is illustrated in Figure 3 , in which an operation with two operands is being transferred to an FU. The port reservation status after an operand write is illustrated with a blue line, which spans until the triggering port FU.t is written into. This yields a constraint
where M o is the current operand node, and K is a set of other operand moves that might be assigned to same port as M o . By setting the left-hand side equal to two, both the triggering move and the current operand move can be equal to one, and all the other operand moves must be zero. The model assumes that the results of a program operation are available for reading at the output ports after an operation latency has passed counting from the time instance the triggering move has been scheduled until the time a new triggered operation overwrites the result. This adds a constraint that all result reads of an operation must be read some time after the results are ready, and no other program operation shall overwrite the result before all the result reads have been scheduled. However, at the same time, pipelined execution of multiple program operations in a same FU must be supported for maximum operation throughput.
As an example, Figure 4 illustrates the port reservation for a legal schedule of moves from two separate program operations in an FU where FU.t is the triggering port and both operations have a latency of two cycles. The first operation result is available at t = 4, that is, after the operation latency has passed. Since the result of the first operation is read only at t = 5, the execution of operation 2 shall not start before t = 4.
In order to enforce all result moves to occur at correct intervals, where R is a set of all result reads of the program operation that M trigger triggers, l is the latency of the operation, and y is result read slack, a maximum cycle count that any result can be stored in an output port. In addition, other triggering moves from all other program operations must be prevented to overwrite the result too early:
where T is a set of other trigger moves to function unit f besides M trigger , d is the latency of the operation of M i,c,t , and R is a set of result moves of M trigger . The reasoning behind the constraint is identical to that of Equation (8).
No proof of completeness is provided at this time, but the correct results were validated using simulations and static resource constraint checking.
The Objective Function
The objective function of the ILP model can be varied depending on the intended requirements the application, the processors, and the software running on it implement. In this article, we focus on minimizing the execution time of the compiled program while other interesting goals could be, for example, to minimize the energy consumption of the program while meeting a maximum execution time constraint.
Minimizing the execution time of the schedule can be formulated as minimizing the latest execution time of all moves of the scheduled program. This can be expressed in the model as an objective function that attempts placing the moves in leaf nodes of the DDG as early as possible. This implies that all the other nodes of the graph are pushed to be scheduled earlier, as they all are descendants of the leaf nodes.
The objective function to minimize the schedule length is min :
where L is a set of moves in the leaf nodes of the DDG. It should be noted that the model must minimize the issue time of all the leaf nodes and not just the ones on the critical path, because otherwise there might be "outlier leaf nodes" that "stretch" the schedule and increase the cycle count of the basic block.
EVALUATION
The ILP model for the TTA scheduling problem was integrated in the compiler of a design and programming tool, TTA-based Co-design Environment (TCE) [Esko et al. 2010 ]. TCE's compiler utilizes LLVM [Lattner and Adve 2004] and Clang version 3.6 for language support, intermediate representation optimizations, and parts of the code generation. Before the code is passed to the ILP-based scheduler, the standard-level three optimizations of LLVM are executed on the intermediate representation. In addition, the registers were allocated and operations were selected using default LLVM code-generation passes.
Compilation was benchmarked on a server that has a 6-core (12 hardware threads) Intel i7-3930K CPU with a clock speed of 3.2GHz, and 16GB of RAM. The models were optimized with a Solving Constraint Integer Programs (SCIP) mixed-integer programming solver, which is one of the fastest noncommercial solvers available [Achterberg et al. 2008] . All basic blocks of the test programs were scheduled using the ILP-based scheduler. The operand slack and the result read slack limits were both set to 5 for 5 for the experiments to decrease the computational complexity of the models. We also run the most complex architecture and programs combinations with higher values, but that did not result in improvements.
The performances of the produced schedules were compared to the heuristic operation-based list scheduler that tcecc uses by default. In addition to the cycle count, for each input program, the number of register accesses was inspected to evaluate the efficiency of applying the TTA-based optimizations that have the capability to reduce register file accesses.
All the execution statistics of the compiled programs were obtained using the instruction cycle accurate architecture simulator bundled with TCE. In order to assess the compile time of a known slower ILP-based scheduler, we also compared the CPU time and the wall clock time spent during the compilation to measure the practical compilation time. The GNU time utility (version 1.7) was used for these measurements.
We used five benchmark programs from the DSPstone suite [Zivojnovic et al. 1994 ]: Infinite Impulse Response (IIR) filter, Finite Impulse Response (FIR) filter, dot product, convolution, and complex number arithmetics. These workloads are typical kernels in digital signal processing applications and contain a reasonable amount of instructionlevel parallelism while being small enough to be feasible for optimal ILP-based scheduling.
The following sections present the results for two architectures with a very different number of architectural resources.
Minimalist Architecture
The Minimalist architecture is a machine resembling a simple RISC core in its resources. It consists of a register file (RF), a predicate register file (BOOL), ALU, LSU, CU, and multiplication unit (MUL). There are three buses with aggressively reduced connectivity while it still provides sufficient connections to execute multiple operations concurrently. The ALU can transport results back to its input ports, enabling bypassing between arithmetic operations. In addition, it is possible to bypass results from the MUL unit to the ALU, and the other way around. LSU results and inputs can be bypassed directly from both the ALU and the MUL unit. The architecture is shown in Figure 5 . The ILP scheduler is able to reduce the cycle count into 52% of the cycles compared to the heuristic scheduler in the best case, and 97.5% at the worst case. The average cycle count equals 70.0% of the cycle count of the heuristic scheduler. The relative cycle counts for all benchmarks are shown in Figure 6 .
The register accesses of the ILP scheduler are presented in Figure 7 . The reduction of register reads is still apparent, although the objective was only to minimize the cycle count. In some cases (the IIR filter and the dot product calculation), the ILP scheduler was able to eliminate all result reads. The schedule produced for the convolution program had slightly more register reads than the heuristic scheduler.
On average, the ILP scheduler performed 64.7% less register file reads compared to the schedule of the heuristic scheduler. In addition, the ILP scheduler was able to eliminate 25.1% of the register writes, on average. In the case of the convolution program, the amount of register writes was equal to that of the heuristic scheduler. Figure 8 presents the compilation times. The heuristic scheduler was able to schedule all programs in 1s or less. It is unsurprising that the ILP scheduler takes distinctly more time to schedule the programs, 4.7min, on average, and 19.1min in the worst case. Table I presents the simulation results for the minimal architecture for the benchmark programs. Table II introduces the corresponding ILP model characteristics.
Clustered Architecture
Clustered architecture is a clustered VLIW-like architecture divided into separate increased connectivity clusters that are interconnected with a sparse connection network. In this machine, three clusters are formed of register file and ALU pairs. In addition, the architecture provides a single MUL unit. Although its interconnection network with 17 buses is quite reduced, the architecture provides a high level of concurrency for the operations.
The architecture is illustrated in Figure 9 . Fig. 9 . Clustered architecture consisting of three separate computing clusters, a multiplication unit, a CU, a LSU, and a Boolean register file. The cycle counts with the ILP scheduler are presented in Figure 10 . On average, the ILP scheduler decreased the execution time by 19.6%. The most notable reduction was in the case of the complex number arithmetics program, in which the number of cycles was 43.7% less than that of the heuristic scheduler.
Register accesses of the programs scheduled with the ILP scheduler with the clustered architecture are illustrated in Figure 11 . The schedules produced with the ILP scheduler executed 45.7% less register reads compared to the heuristic scheduler. In the case of the IIR filter, the ILP scheduler was able to eliminate all register reads. Furthermore, there were 12.2% less register writes, on average, compared to the heuristic scheduler. Note: In the integer linear programming columns, the percentage in parentheses shows the difference with respect to the corresponding value of the heuristic scheduler. Figure 12 shows the compilation times for the ILP scheduler with the clustered architecture. The execution times are even greater than those of the minimalist architecture. This is due to the fact that the clustered architecture is much larger and provides much more opportunities for data transports between the units, thus presents more degrees of freedom and options to the scheduler. However, the growth of compilation time is much less than the increase in the number of resources when compared to the Minimalist architecture. The average compilation time for the clustered architecture is 7.2min. In the worst case, the proposed scheduler uses 21min to produce a schedule for the FIR filter program. Table III presents the simulation results for the clustered architecture. and Table IV contains the corresponding integer linear programming model characteristics.
RELATED WORK
Integer programming and the branch-and-bound solver algorithm have been used extensively in the past for producing optimal solutions for various optimization problems. A general solution for the scheduling problem was proposed in the context of operations research [Greenberg 1968 ]. This article presents a generic mixed integer formulation for scheduling n jobs to m machines with objective functions that minimize either the total time to process the jobs or the machine idle time. Integer programming was used for the first time to optimize machine code in Arya [1985] , which considered a class of register-based vector processors (with Cray-1 used as an example), and provided separate formulations for both acyclic code and loops.
ILP-based simultaneous instruction scheduling and register allocation for multiissue machines is proposed in Chang et al. [1997] . In this work, two separate solutions were proposed: one that can handle spill code, and another assuming that no spill code is needed. The former is much more complex and should be used only in the case that one must assume that the available registers are not sufficient for the program's live variables. Their solution makes several simplifying assumptions, such as assuming that live ranges of variables do not span across multiple basic blocks and that the scheduled instructions always take only one cycle to execute. In our model, we make no such assumptions in order to make the model generally usable for different programs and targets with multiple cycle function units. Especially important for performance is our model's capability to handle pipelined execution of operations in function units.
Similarly, Kästner and Langenbach [1999] propose utilizing ILP for integrated scheduling and register allocation. This work targets an irregular DSP target and also does approximations to produce good enough schedules in faster compilation time. The compilation time problem of ILP-based instruction scheduling is also addressed in Wilken et al. [2000] , which presents model size reductions by means of data dependence graph simplifications. These techniques are generic enough to be adapted in our model to reduce the compilation time.
Work in Kästner and Winkel [2001] targets IA-64, an architecture with instruction bundles that attempt to reduce the NOP waste of VLIW architectures. Due to the bundle selection affecting the size of the resulting instruction stream, this paper attempts to optimize the program bit size as a secondary goal to schedule length. The two problems of scheduling and bundling are done in separate stages; thus, this does not always lead to optimal solutions, but allows the compilation to complete in much shorter time. Our model takes into account the instruction templates (which describe the NOP slots, move slots, and immediate pieces in the instruction word) during the scheduling process as one optimized parameter. This enables including the bit size minimization as an optimization goal in the case of variable length instruction targets.
In addition to combining register allocation and instruction scheduling to the same ILP formulation, Bednarski and Kessler [2006] also integrate the instruction selection phase for implementing all major phases of VLIW code generation using the same mathematical model. The model is extended to support clustered VLIW architectures in Eriksson et al. [2008] and modulo scheduling in Kessler [2009, 2012] . These present one of the most complete published solutions to clustered VLIW code generation using ILP.
Constraint Programming (CP) is another combinatorial approach with a higher abstraction level than integer programming. In comparison to ILP, it appears to be a "cleaner" approach, as the models do not need to be formulated as linear equations and inequalities that might not be natural for the problem domain. A major difference from linear programming is that CP is meant for producing feasible solutions that satisfy a set of constraints, whereas integer programming gives one or more optimal solutions that are known to produce the minimal value for a desired objective function directly. CP has also been used for implementing code generators. A notable recent work is Unison [Lozano et al. 2012 [Lozano et al. , 2013 Lozano 2014 ], a retargetable CP-based code generator that combines register allocation and instruction scheduling. Formulating the TTA scheduling problem as a CP problem is an interesting alternative that we plan to study more in the future.
The closest published work to ours is Guo et al. [2006] . It proposes an ILP-based compiler backend for Synchronous Transfer Architectures (STA). The STA approach is a simplification of the TTA that we target in our work [Cichon et al. 2004] . Interestingly, STA also supports software bypassing (the authors refer to it as direct data routing) which is included in their ILP model. However, STA is not transport programmed, but is more like a standard VLIW with exposed function unit ports. It has a separate opcode field in the instruction word (like standard VLIWs) to start the operation execution for each function unit. Thus, it does not present operand/result scheduling freedom to the compiler, simplifying the scheduling and the instruction word, but missing the register file pressure reduction available from freely scheduled operand data transports [Jääskeläinen et al. 2015] . In addition, their ILP model limits bypass distance to be equal to zero. In other words, direct data routing is done only in the case that the result of an operation is ready on the exact cycle that it is needed. This reduces the impact of bypassing, as is shown in Guzma et al. [2008] .
Based on our related work search, the software-controlled datapath interconnection network and the related unique software optimizations in the TTA programming model bring additional scheduling challenges that are not published in earlier work. The model that we presented in this article is novel in its capability of utilizing reduced interconnection network TTA machines efficiently when starting from higherlevel language software descriptions such as C, which is challenging for heuristic approaches.
FUTURE WORK
The sequential program input to the scheduling phase in the used compiler framework has some of the code-generation phases done in order to exploit generic register allocators and instruction selectors, and to limit the model size. Clearly, the best-quality code would be generated when all the decisions of the code generation, that is, instruction selection, register allocation/partitioning, and instruction scheduling/bundling were handled simultaneously by the model. However, our experience indicates that the additional antidependency removal benefits from integrated register allocation are probably not very high due to the model already using dead result elimination and software bypassing, which eliminate some of the antidependencies resulting from register reuse. On the other hand, integrating the variable to register file assignment (operation/variable partitioning) is expected to bring more code quality benefits with clustered machines.
When using a code generator based on a mathematical model, one has to be very careful when expanding the controlled parts, as the compile time can explode very easily, making the scheduler usable only for ever smaller pieces of code. Further study of integrating the different phases of a TTA code generator in the model, their compile time impact, and the possible schedule quality maintaining model simplifications (such as those proposed in Wilken et al. [2000] ) are left to future work. As a higher priority, we plan to integrate modulo scheduling to the model, as data-oriented processors such as TTAs greatly benefit from software pipelining. In addition, we are experimenting with alternative objective functions that better exploit the energy-saving optimizations of the TTA programming model.
CONCLUSIONS
Static processor architectures, especially so-called exposed datapath architectures such as TTA, are notoriously difficult compiler targets. Code generators based on heuristics produce schedules reasonably fast, but usually lead to suboptimal results and resource underutilization.
In this article, we proposed an ILP-based scheduler for TTA processors. The scheduler can exploit the important TTA-specific software optimizations efficiently and leads to much better-quality schedules. The model was verified by simulating the produced code and benchmarked against a heuristic scheduler, which showed performance improvements. In the best case, the cycle count was reduced to 52% of the one with the heuristic scheduler. Dramatic reductions were also measured in register file accesses: Aat best, the ILP scheduler reduced the number of register file reads to 33% of the heuristic results and register file writes at best to 18%.
The longer compilation time of the ILPscheduler (up to 21min in the presented benchmarks) is tolerable when producing the final code for production, but a heuristic scheduler is more usable in the iterative "compiler-in-the-loop" design space exploration in which hundreds of architecture variations are compiled to find the best alternative. Interestingly, we found that the architectures that are more challenging to the heuristic scheduler are often better targets to the ILP scheduler due to the lower degree of freedom. For example, TTAs with highly reduced connectivity network also have reduced model size, making ILP-based scheduling especially interesting for aggressively optimized production architectures.
