Fast SIMD-parallel execution units are available in most modern microprocessors. They provide an internal parallelism degree in the range from 2 to 16 and can accelerate many data-parallel algorithms. In this paper the suitability of ve di erent SIMD units (Intel's MMX and SSE, AMD's 3DNow!, Motorola's AltiVec and Sun's VIS) for the simulation of neural networks is compared. The appropriateness of the instruction sets for the fast implementation of typical neural network operations is analyzed and the results of an experimental performance study are presented.
Introduction
A few years ago, many researchers studied the parallel implementation of arti cial neural networks. Their objective was to accelerate especially the compute-intensive training phase by parallelizing the learning algorithm and by mapping the neural network and the training data set onto the processors of a parallel computer architecture 1 .
Today, modern microprocessors are operating at clock frequencies of 1 GHz or still higher and achieve already a satisfactory performance for many neural network applications. However for certain compute-intensive learning tasks or real-time pattern recognition tasks the power of a single processor is often still insu cient. If spacious parallel systems can not be used (e.g. on autonomous mobile robots), the instruction level parallelism (ILP) and data level parallelism (DLP) o ered by current processors must be exploited to accelerate the neural network simulation on a single-processor system. ILP represents an implicit parallelism which is based on the simultaneous execution of instructions by the superscalar processor architecture. Hammami 2 has shown that for four di erent neural network applications the performance gain due to ILP is usually low. Internal data dependencies prevent the processor from executing instructions simultaneously. DLP was introduced in most microprocessor architectures several years ago to accelerate especially signal and multimedia applications. Here several 8-bit, 16-bit or 32-bit data elements are packed in a 64 or 128 bit register and all arithmetic operations can be executed on corresponding elements of two registers in a SIMD-parallel mode. The user must explicitly program all data-parallel operations and is responsible for the attainable performance.
Neural network algorithms are similar to signal processing algorithms because they are also based on vector/matrix operations. Furthermore, the encoding of all neural network variables in words of 16 bit is su cient for training 3 . Thus, DLP seems to be suitable for the fast simulation of neural network algorithms. However, no detailed analysis about the achievable performance gain has been published so far. Merely Gaborit et al. 4 have shown that Intel's MMX can accelerate distance calculations in a neural network application. However, they do not consider training.
In this paper a detailed analysis about the suitability of ve di erent SIMD-parallel execution units for the fast simulation of neural networks is provided. First, the most important neural operations that must be accelerated by exploiting DLP are presented in the next section. Section 3 compares the appropriateness of the instructions sets of all ve selected SIMD units for implementing neural operations. The results of an experimental performance study are presented in Section 4. Suggestions for future improvements of SIMD execution units conclude this paper.
2 Data-parallel neural network simulation 3 Instruction sets of SIMD extensions Five SIMD-parallel execution units have been selected for this experimental study. Table 1 illustrates the main di erences: Intel's MMX 5 (also available in current AMD processors) and Sun's VIS 6 allow only operations on integer data, whereas Intel's SSE 7 and AMD's 3DNow! 8 support only the 32-bit oating point data format. SSE and 3DNow! provide a few additional integer instructions for improving MMX. Motorola's AltiVec 9 is the only SIMD extension that is designed as a general-purpose vector unit. It is suited for both integer and oating point calculations and o ers the most powerful instruction set. Table 2 lists all SIMD instructions that are required for the data-parallel implementation of the neural operations I to V on all ve SIMD units. Whereas a unique instruction is available for each SIMD-parallel operation on 32-bit oat data elements, a variety of arithmetic instructions (for di erent data sizes, for di erent types, with/without rounding, with/without saturation) exists in case of integer operands.
According to the results of Holt 3 , a precision of 16 bit should be selected for neural network implementation on integer SIMD units. Unfortunately, the frequently used SIMD-parallel multiplication of 16-bit numbers must be realized in a di erent way on all ve units. Fig. 1a shows the correct xed point multiplication scheme applicable in all cases. Here certain 16-bit result windows must be selected out of p simultaneously computed 32-bit products.
However, such a powerful instruction is not available on any multimedia unit. Instead it must be simulated by sequences of several instructions. MMX computes the 16 upper and the 16 lower bits of four 32-bit products separately with two di erent instructions. Two merge, two shift and one pack instruction are required for storing the four correct 16-bit results in a single register (see Fig. 1b ). AltiVec multiplies corresponding 16-bit elements on even or on odd index positions of two 128-bit registers. By using ve additional instructions (2 shift, 2 merge and 1 pack, see Fig. 1c ) a 128-bit register containing For the SIMD-parallel computation of vector-matrix products a special sum of products instruction calculating a i b i +a i+1 b i+1 and a vector reduction (for the nal addition of the partial sums) are useful. Unfortunately, they are available only in some instruction sets for a few data types (see Table 2 ). Hardware support of saturation is important for implementing correct SIMD-parallel integer additions. Also rounding is advantageous to increase the precision during iterative neural network training, but it is only supported in certain AltiVec and 3DNow! instructions. On all SIMD units, further instructions are available for reordering the elements in a register. The pack instruction is required for selecting 16-bit windows out of one or two registers that contain several 32-bit products or sums (compare the multiplication example in Fig. 1 ). Only Sun's VIS allows the selection of 16-bit windows at arbitrary positions (however without rounding). All instruction sets o er a merge instruction (also called unpack) that mixes the elements of two registers into a new register according to a xed scheme. For MMX it is mandatory for concatenating the upper and the lower halves of the products (compare Fig. 1b) . 3Dnow! and SSE provide generalized shu e instructions and only AltiVec allows arbitrary permutations. Replication of scalar operands is required in operations III, IV and V and must be implemented either by an appropriate sequence of merge instructions or by a single permutation instruction.
streyDLP: submitted to Parco2001 on July 31, 2001 4 Performance analysis All neural operations I to V described in Section 2 were implemented on the ve selected architectures. Either 32-bit oating point (SSE, 3DNow!), 16-bit xed point (mapped to 16-bit integer for MMX, VIS) or both (AltiVec) were used as parallel data types for all neural network variables. For Intel and AMD processors, the routines were encoded in assembly language. A C language interface was available only for the SIMD extensions of Sun and Motorola. For reference, all operations were implemented also in C using the float data type. The Gnu C compiler and the Gnu assembler were used for Intel and AMD processors, the Sun Workshop 5.0 C Compiler for Sun and the Metrowerks CodeWarrior for Motorola's PowerPC. Compiler optimizations have been switched on. Special care was taken for the correct alignment of all data elements because the penalty for misaligned data is high for most SIMD units. As hardware platforms PCs with either 500 MHz Pentium III or Athlon, Sun workstations with 400 MHz Ultra II and a Macintosh containing a 500 MHz G4 PowerPC processor were used. The execution time of all ve neural operations was measured on the SIMD units and on the oating point units of all processors. Then the speedup of the SIMD code compared to the reference oat code was calculated for each operation. measured speedup for many operations was surprisingly high (up to 5.6 times faster than the oat implementation for small matrices). This anomaly results on the one hand from several powerful SIMD instructions (such as sum of products or vector reduction) that replace more than p scalar instructions.
On the other hand some SIMD instructions (especially SIMD multiplications) provide shorter latencies than their corresponding scalar counterparts. For some operations (especially III and V) the SIMD performance remains below the theoretical speedup on certain SIMD units. This e ect is due either to the high overhead for the required data reordering (especially replication of scalar operands) or to the lack of appropriate instructions.
To study not only the SIMD performance of single operations but also of a complete neural network, a Radial Basis Function (RBF) network was implemented on all SIMD units and processor cores. The k{m{n RBF network represents a typical arti cial neural network model that can be used for many approximation and classi cation tasks. It consists of k input nodes, m RBF neurons with radially symmetric Gaussian output functions in the rst layer, n simple linear neurons in the second layer and is trained by gradient descent.
All neural operations I to V described in Section 2 are contained in the underlying algorithm. The recognition time and the training time according to the presentation of a single input vector to RBF networks of three di erent sizes were measured and the speedup of the SIMD-parallel code related to the reference oat code was calculated (see Table 3 ).
Recognition with RBF networks is mainly based on operations I and II that can be implemented very e ciently on most SIMD units (see above). So it is no wonder that the speedup for recognition is higher (up to 8.6) than the speedup for the RBF training phase (up to 6.6) that requires all ve neural operations. When the network size is varied, two contrary e ects can be observed. On the one hand the speedup for the integer implementation (e.g. on MMX or AltiVec) can be increased if the network is enlarged. This e ect results from the relation between the size of the weight matrices and the size of the L1 or L2 caches. Whereas larger 16-bit integer matrices still t in the caches, this is not always the case for larger reference 32-bit oat matrices. On the other hand the oating point implementations on SSE and 3DNow! show the reverse e ect. Here for larger matrices the fast SIMD units can no more be supplied with 32-bit data elements at su ciently high speed.
Conclusions
The instruction sets of the analyzed SIMD units are not optimal for the implementation of neural networks. Several instructions are missing for an e cient implementation of all important neural operations, e.g.: vector reduction and a general multiply&add for all data types, fast replication of a speci c element within a SIMD register or special load/store instructions for the handling of vector and matrix sizes that are not a multiple of the parallelism degree.
Furthermore, for integer SIMD units a special 16-bit 16-bit ! 16-bit parallel multiply instruction that supports an arbitrary selection of saturated and rounded result windows out of 32-bit products would be advantageous. Nevertheless, the overall SIMD performance turned out to be fairly well. A speedup of up to 9.8 for single neural operations and a total speedup of up to 6.6 for neural network training could be achieved. For very large neural networks the simulation can be accelerated furthermore by combining cluster or SMP computing with local SIMD computing in each processor node.
