We discuss the architecture and microarchitecture of a scalable, parametric vector accelerator €or the TLM algorithm Architecturelevel experimentation demonstrates a n order of magnitude complexity reduction f o r vector lengths of 16 32-bit singleprecision elements. W e envisage the proposed architecture replicated in B SOC environment thus, forming a multiprocessor system capable of tapping parallelism at the thread level 3 s well as the data level.
1, Introduction
Prior attempts to implement the TLM algorithm [I] on general-purpose architectures have fallen into two major categories: Shared memory, cache coherent multi-processors 12, 31 and distributed.processors [4] with shared-memory machines often demonstrating better performance.
;
The TLM is a highly-parallel three-dimensional numerical algorithm which has the potential for being accelerated along its innermost loop via vectorization thus, tapping parallelism at the data level (DLP). Furthermore, the algorithm can be statically 'sliced' (threaded) along the second outer Loop, and be executed on the previously mentioned platforms via dfferent processors executing different iterations. Such parallelism is known as thread-level-parauelism
[5] and is currently being pursued by all major microprocessor vendors.
Successful acceleration of such parallel codes depends very much on the algorithmic communication pattern which dictates the level of data sharing across the multiple processors. In the case of the TLM, data transfers between individual nodes is very high and in extreme cases the data transfer during the toiinecf part of the algorithm can be much more computationally expensive than the numerical calculations during scatrering.
The performance differential between shared memory and distributed machines is often attributed to such data sharing issues. Custom architectures €or accelerating TLM codes have been proposed in the past by Stathard and Pomeroy [6] . Our work proposes a custom vector approach to accelerating the inner loop of TLhl codes, quite unlike this earlier work. I n our case, an 0-7803-8562-4/04/$20.00 02004 I EEE. The programmer's model specifies a parametric number of vector registers (VRMAX), each consisting of a parametric number of 32-bit singleprecision elements (VLMAX). There is a scalar register file consisting of a parametric number of scalar 32-bit elements (SRMAX), used for virtual address computation, immediate passing and vector splat operations. Additionally, there are two vector accumulators each holdmg VLMAX singe-precision elements and finally, the vector length register (VLEN) which specifies the number of bytes that will be affected by the currently executing vector opcode. The Instruction Set Architecture (ISA) of the accelerator includes standard vector floating point operations except &vision, vector LoadlStores, and a generalized permute instruction. A large number of sub-element manipulation instructions (including vector splat instructions) can be synthesized basad on the three-operand permute infrastructure. The ISA is summarized in 
Methodology
We have applied a basic implementation of the SCN TLM algorithm 111 i n which no external boundary condtions were used. In the particular case, a single output node was used as a diagnostic aid to verify correct operation. We used the accelerated scatter method of Naylor and Ait-Sadi as proposed in [9] . The non-vectorized (scalar) algorithm was profiled both i n native mode (IA32'Linux) as well as on our simulated processor for consistency of results. Scalar code profiling revealed a scatterconnect compiexity ratio of 63:37, averaging over all the studied configurations. Our simulation infrastructure is based around the simplescalar toolset 1101 which provides a complete computer architecture modelling and performance evaluation environment. The compiler used was GCC 2.7.3 with optimizations (-03).
Results
The reference problem chosen for benchmarking was a fixed mesh of lo6 nodes. This number is convenient as it gives a prime factorisation of 26X56, which allows for the aspect ratio of the problem space to be varied over a reasonable range whilst maintairiing the same number of nodes. We measured the absolute complexity (dynamic instruction count) of the scalar code for all configurations of interest. Then, the vectorized code was run and its complexity recorded for a maximum vector length of up to 15 single-precision elements. Vector length 
