Abstract
Introduction
Very Long Instruction Word (VLIW) architectures [9] exploit instruction level parallelism (ILP) without the need for run-time data dependence analysis as this is performed at compile time, resulting in a simpler hardware organization when compared to a superscalar processor. This allows the inclusion of a larger number of functional units (FU) into a single chip, increasing the possibilities for parallelism. ' This work has been supported in part by research grants from Capes (Brazil), British Council and Ministry of Education of Spain under Acciones Integradas grants No. 1016 and 202b Sophisticated compiling techniques have been developed in order to identify parallelism and schedule operations for ILP machines, such as software pipelining [4] , a technique that allows the initiation of successive loop iterations before prior ones have completed. One class of software pipelining algorithms is modulo sched g [ 161, an efficient scheme to optimize the use of mach with complex resource patterns.
The ideal VLIW machine has a number of concurrent FUs, connected to a register file able to perform two reads and one write operation (op) per functional unit in each cycle [3] , thus requiring complex hardware organizations with a possible increase in access time. Furthermore, most software pipelining schemes assume that arithmetic operations are all register-register operations and data is transferred between registers and memory using load and store instructions. The time span from the reservation of a register to hold a value up to the last cycle before the value is used is called a lifetime. In a software pipelining scheme multiple iterations can be initiated before previous have completed, which means that lifetimes from the s operation may coexist, thus requiring distinct storage locations and increasing register pressure [14] . The use of a conventional registerfile (RF) is not a straightforward solution, even for a modest number of functional units, and is unacceptable in terms of scalability [8] . Early designs proposed alternative register file organizations to deal with the problem, among them the Polycyclic architecture [ 161 and the Cydra machine [ 171. Alternative architecture models comprising clusters of FUs and small private register files have been proposed [3, 121 . Although this approach indeed reduces the hardware complexity of individual register files it imposes further constraints on both the scheduler and the register allocator, which may result in performance degradation if not properly handled.
Our approach to the problem consists in the design of a scalable VLIW architecture comprising clusters of functional units and private register files implemented as queue structures (QRF), which in turn may also be used as a mechanism for inter-cluster communication [6] . Register files organized by means of queues are believed to be less complex than conventional organizations [ 1, 10, 1 11. However this hardware simplification imposes new constraints on the register allocator, requiring new techniques to efficiently exploit such organization. We have taken advantage of the regular pattern of production and consumption of lifetimes resulting from modulo schedules to deduce and prove a Compatibility Test [8] to decide if values produced by distinct operations are Q-Compatible, which means they can be stored in the same queue. The basic idea is that in a modulo schedule two or more lifetimes can share a common storage queue as long as their production order exactly matches their consumption order, which can be checked by applying the following theorem: 
The following sections of this paper present and discuss some of our latest findings towards the design of a clustered VLIW architecture model using queue register files. We believe this approach compares favourably against conventional ones on a number of aspects, including the required silicon area, register name space, register allocation, code generation and facilities to implement a scalable clustered machine [7] .
Dealing with Simultaneous Writes
The architecture model being developed assumes that values produced by an operation are stored in a register file by means of a write op, to eventually be consumed by another operation(s) by means of a read op. As seen in the data dependence graph (ddg) in Fig. l(a) , a value produced by a given operation my be consumed by more than one operation. If a conventional register file is used just one write op is necessary, no matter how many times this value will be read ( Fig. l(b) ). However in our QRF model a value can be read from a queue only once, being destroyed afterwards. This implies that if a value is consumed by more than one operation it must be stored in distinct queues ( Fig. I(c) ), requiring simultaneous write operations. Among other problems this would complicate the instruction format and the access to queues.
To deal with the problem we propose the introduction of a copy operation, which should be executed by a dedicated In terms of hardware cost it requires only an extra FU and the corresponding register file ports, which should be simple to implement.
Figure 2. Including a Copy Operation
Extra delays may increase the loop execution time due to alterations in the ddg to include copy operations, so we performed some experiments and compared the results against the ones obtained for a basic configuration not using copy operations [7] . The experimental framework consists of machine models of 4,6 and 12 FUs, a total of 1258 innermost loops from the Perfect Club Benchmark [2] , a version of Rau's Iterative Modulo Scheduling (IMS) [ 151 algorithm, and register allocators for both, conventional and queue register files, as described in detail in [7] .
We found that using copy operations does not increase significantly the number of queues required to schedule a given fraction of the benchmark loops, specially for loops requiring 16 or 32 queues. It should be noticed that copy operations do not change the machine configuration required to schedule most of the loops of the benchmark, which consist of 32 queues, as seen in Fig. 3 .
In a modulo scheduled loop the code execution at full performance occurs at the kemel stage, which accounts for the largest share of the total execution time. Our framework was able to schedule around 95% of the loops with Number of Queues the same I1 after the insertion of copy operations, which means that there is no performance degradation in the execution of the kernel for most of the cases. The remaining loops required a number of slots to allocate copy operations that could not be found within the original 11, requiring an increase in its value (tolerable in most of the cases). The stage count is related to the number of loop iterations simultaneously in execution. A higher stage count value results in longer prologue and epilogue phases, the less efficient stages surrounding the kernel execution. We found that inserting copy operations in the ddg does not change the value of the stage count for most of the loops [7] . To summarize, the use of copy operations allows us to solve a particular problem arisen by the use of a QRF, at a cost of extra FUs and register file ports, and a small increase in execution time for 5% of loops. An interesting side effect observed is a small reduction in the required number of queues and queue positions for the most demanding loops.
Increasing Parallelism Exploitation
One of the main objectives of this research work is to design a scalable VLIW machine. The original code of loop bodies does not always present enough operations to take full advantage of a highly parallel machine, requiring some actions in order to avoid sub-utilization [13] . Loop unrolling [5] is a well known compiler optimization that replicates the body of a loop a given number of times, called the unroll factor, resulting in a larger number of available operations for parallel execution. However, unrolling can also generate side effects that may compromise the benefits achieved. In this work we are particularly interested in a possible increase in register pressure.
We performed some experiments to measure the effectiveness of using loop unrolling with this architecture model, using the experimental framework described in [7] . A parameter called IIspeedup was used to compare the execution time of the kernel code
We found that as seen in Fig. 4 
Figure 4. Initiation interval Speedup
As expected, loop unrolling produces a moderate increase in the required number of queues, although 32 queues are still enough to schedule over 90% of the loops for any of the machine configurations considered [7] . An increase in the number of queues is less likely for the large loops because in ge o not require unrolling to exploit efficiently the e resources.
cluster would require which needs a disprop multiported register file, chine into clusters, each of them comprising a few FUs and a small register file. The design and implementation of individual clusters should not be an issue in such model, however the assignment of operations to clusters must be properly handled to conform with the inter-cluster communication topology and to minimize the 11. In the first set of experiments we defined a cluster configuration comprising 3 FUs, which are 1 L/S (LoadIStore), 1 ADD (Adder) and 1 MUL (Multiplier), and also an extra FU to support copy operations as shown in Fig. 5(a) . All of them connect to a private QRF. We assume that clusters are interconnected by a bidirectional communication ring, implemented by means of queue structures (Fig. 5(b) ). These communication queues are used to allocate registers as if they were a cluster private QRF, but with the difference that a value written by a FU from a given cluster will be read by a FU belonging to another one. 
Figure 5. Clustered Machine Organization
The partitioning process is carried out by adding some heuristics to the IMS algorithm in order to avoid communication conflicts. These conflicts may arise because we do not as yet consider the introduction of operations to transfer a value between indirectly connected clusters. This limitation can prevent an operation from being scheduled in any of the clusters, leading to a backtracking process to unschedule conflicting operations. Backtracking may result in a larger I1 value that reduces the performance that would be achieved if a single cluster machine had been used instead. Accordingly, one of the aims of these first experiments is to measure how effective this partitioning algorithm is at distributing operations among clusters for the same I1 that would be required using a single cluster machine. We performed experiments for machine configurations of 12, 15, and 18 FUs (4, 5, and 6 clusters respect.) plus the required FUs to support copy operations. Loop unrolling was performed in all the experiments.
The data presented in Figure 6 show the fraction of loops scheduled for a clustered machine exhibiting the same I1 as the schedules for the corresponding single cluster machine. There is no increase in the I1 value for 95% of the loops when a 4 cluster machine (12 FUs) is used, and when it occurs is typically of one cycle only. When a 5 cluster machine (15 FUs) is used it is possible to schedule 84% of loops with the same 11, decreasing to 52% when a 6 cluster machine (1 8 FUs) is used. The results indicate that the partitioning scheme adopted tends to produce poorer results as the number of clusters increases, which is mainly due to the inability to move data values between non-adjacent clusters. We have noticed that the stage count remains the same for a large fraction of the loops, suggesting that the performance of a clustered machine should not be significantly influenced by variations in the stage count [7] . In terms of machine resources we have found that a cluster configuration comprising 8 queues for the private QRF and another 16 queues to implement the communication ring (8 to be used in each direction) should suffice for any of the machine models analysed. Figure 7 shows the basic unit that could be used to implement a clustered machine. A small fraction of loops would require additional resources, however we believe that further improvements in partitioning are possible. Of course, in a practical system spill code will occasionally be required to deal with finite numbers of queues and queue positions. We performed static and dynamic analysis on the aver-age number of operations issued per cycle (IPC). The static issue, IPCstatzc, accounts for the number of instructions issued in the kernel phase for one iteration. The dynamic issue, IPCdynamze, take into account the total number of iterations performed on each loop body to estimate its execution time, also including the prologue and epilogue phases.
One of the analyses considered all loops of the benchmark, as seen in Fig. 8 . Those results are compromised by the fact that several loops are simply not able to take advantage of the extra machine resources available as they are constrained by recurrence circuits in the loop body. We performed a second analysis in order to have an insight on how well this architecture model deals with programs whose execution is constrained by the number of available FUs (Fig. 9) . The differences found between a single cluster and a clustered machine comprised of either 15 or 18 FUs are mainly due to the partitioning algorithm, indicating again the need of a scheme all g move operations to transfer data between no adjace ters. The average instruction issue is higher for the static analysis mainly because it does not compute the less efficient prologue and epilogue phases, which is done in the dynamic analysis. For the most aggressive machine it was observed er improvement on dynamic issue than on static issue.
happens because a few large loops, accounting for a large share of the total execution time, can take full advantage of the additional functional units, an effect that is only seen seen in dynamic analyses. This is more noticeable for clustered machines because these large loops can be scheduled without the partitioning algorithm degrading performance, which further emphasizes their influence on the overall results. 
Conclusion
This paper has present ults on a new partitioning strategy to be used with stered VLIW architecture model based on register files organized by means of queues. A scheme based on copy operations was proposed to deal with problems associated with values to be consumed more than once. Experimental results showed that it is able to solve the problem with no performance degradation in most of the cases. The use of loop unrolling resulted in dramatic improvements in loop ion and exploitation of parallelism, which was acc hed without a significant increase in the number of machine resources required. Some model were done, ation of the hardware concommunication toalgorithm has been defactory results for maHowever its efficiency are used, thus requirpology among them. veloped, allowing us to obt chines composed of 4 a is compromised when ing a more sophisticat transfer values between indirectly con scheme should make possible for a clustered machine to achieve performance figures similar to the ones observed for a single cluster architecture scalabi can also be used t sion on the related techni ed is beyond the scope of this paper. In order t prove this model we are currently working on this new partitioning technique, some strategies to deal with loop invariants, and also refining some of the hardware design specifications, including a complexity model for the queue register file.
