Recent hardware developments have led to use shared memory as an efficient parallel programming way. The main goals of the work reported here are to speed up executions and to decrease development time of parallel neural networks implementations. To allow for such implementations, a library has been dehed, as a bridge between neural networks and general purpose MIMD computer parallelisms.
Introduction
Artificial Neural Networks are time consuming, especially during the learning phase. This cost together with the weak computation performance of a computer with regard to a human brain [4] make it difficult to test some complex large connectionist models, !&e some inspired from biological reality or some with a high dimensional input space or a large number of neurons. Another property of neural networks is that they include a large amount of natural parallelism. Consequently, it seems natural to use hardware parallelism technology to implement connectionist models. Parallel implementation can deeply exploit neural networks inherent parallelism (with a distributed set of neurons computing their activity simultaneously) and synchronism (a neural activity is computed from the outputs of all the connected neurons) and thus speed network execution. Furthermore, parallel machines are nowe days more widespread and easier to use and one can thus more and more seriously think to implement artificial neurd networks to benefit from their natural parallelism [7] .
Numerous attempts have been already done in that direction [SI. Because neural network parallelism is very different from modern and general purpose parallel computer parallelism, it is difficult to map artificial neural network models directly onto a parallel machine (see figure 1) . Consequently, most parallel implementations are concerned only by a specific neural model. Beyond dedicated implementations, this paper proposes a general approach which consists of a library for adapt-0-7803-5529-6/99/$10.00 01999 IEEE 2441 ing neural network parallelism onto a general purpose computer parallelism.
Making a set of programming tools available for neural networks developers would d o w them to efficiently use parallel hardware resources without the need for specific knowledge. These tools must offer s a c i e n t flexibility so that a wide range of neural networks with diverse interconnection structures and computational needs can be efficiently processed [lo] . 
Neorni

Network
MIMD
Hardware and biological parallelisms
A . Neuronal parallelism
The brain is often described as a massively parallel and distributed machine. It is composed with billions of elementary processors, neurons, who send their activity through unidirectional channels, connections.
Artificial neural networks have been built on the same principles, according to the formal neuron model [5], a simple calculus automaton with a set of weighted inputs and an output. Simple distributed activities together with the principle of information widespreading, within a neural network, lead to the characterisation of a fine grain parallelism and a message passing communication paradigm for this kind of structure.
B. Hardware parallelism
Parallel computers are classified into two main classes : SIMD' and MIMD2 machines [l] . SIMD machines use a great number (up to 65536) of specific processors, that execute small computations and a lot of data exchanges. An inspiration from natural neural networks can lead to a mapping of one neuron per processor. Thus SIMD machines seem well adapted to neural computation. Some tools are available, l i e Cupit [8] , but the conception and production of these machines were stopped several years ago for technical and financial reasons.
MIMD parallel machines compound up to several hundreds very powerful processors, such as workstation ones. This kind of computer supports only coarse and medium grain parallel applications. At the origin, they were distributed memory computers and communication between processors used message-passing paradigm.
If this communication protocol is the same as neural network one, these parallel properties are different because, contrarily to these machines, neurons exchange short messages a lot (large connectivity) and often (simple calculus at each time). It is thus hard to get good performances if the application needs frequent commu-
To use this parallel hardware technology, connectionist community had to adapt its models. Commonly, they are modified to reduce communication between processors : communication between neurons is reduced and the topology of the networks is changed accordingly. Therefore, with a distributed memory MIMD computer, a connectionist programmer must know parallel programming and parallel architecture to obtain an efficient implementation. Moreover a specific solution must be developed for each different model of network to parallelise [2] .
Since a few years, a new kind of MIMD parallel computer has appeared : Distributed Shared Memory (DSM) computers [9] . Presently, DSM technology goes up to 128 processors. It supports various parallel paradigms, like message passing and memory sharing (see figure 2) , and sharing memory is less time consuming than sending messages, when shared memory is well managed 
IV. Our approach
The objective of a parallel implementation is usually to increase the speed of program execution and to deal with larger problems. But this is not our only aim. Specific objectives of our work are linked to the fact that it is intended to provide help to connectionist developers, who are rarely specialists in parallel architectures.
As a consequence, the mapping from iine grain neuronal parallelism to coarse grain architectural parallelism must be transparent for the user. This transparency can be obtained if only a few simple functions are proposed to the user (that is the reason why we chose to develop a library and not a complete new language with its own syntax like Parcel [12]). To increase friendliness, we also require that the same code using our library be compilable and efficient on both sequential and MIMD parallel shared memory machines.
In order to allow for the implementation on a DSM MIMD machine, with coarse grain parallelism and shared memory, of neural networks, rather characterised by fine grain parallelism and message passing communication, we chose to develop a library, as an interface between these paradigms. As "C" is the usual language used by connectionist community, our library is implemented with this language.
Another objective of our work is to facilitate the implementation of biologically inspired neural networks, including such an interesting distributed characteristic as topology. To implement biologically inspired neural networks, developers have to manage many synchronisation problems. As the brain can be Seen as asynchronous, each neuron is activated when it receives a signal through its connections. To simulate this behaviour, it is necessary to synchronise all the neurom and to manage the updating of their outputs.
We propose to develop neural networks with two levels of parallelism. First, the library allows connectionist programmers to develop their model with respect to the fine grain of the biological inspiration, while building the neurons. Second, network building with our library can be executed in general purpose MIMD shared memory computers.
V. Library Overview
Our library allows the building of networks using the smallest grain of the connectionist formalism, the neuron. A neural network is a set of autonomous and synchronous neurons communicating with messagepassing through beforehand declared connections. A network is implicitly created while defining a set of neurons and their connections.
A. Neurons
A neuron is an autonomous entity which is activated when it receives a signal from at least one of the neurons to which it is connected, its inputs. Its own activation is sent through its output. T h i s activation is determined from its last activation, the signals it receives from its inputs and the state of its environment.
In the same way, with our library, each neuron has a single output and receives its inputs by connecting itself to the output of any other neuron. One neuron has no limits in terms of number of Connections, and each connection is unidirectional. Through these links, for each cycle, it receives the value of the output of the origin neuron evaluated at the end of the previous cycle.
In the simulation, time is divided into cycles. At the end of each cycle, outputs are updated. In the Same cycle, a neuron can change its output, and every neurons can read this variable, but the real value will be the value defined at the preceding cycle.
Designed with our tools, a neuron is a function processing its output with respect to its local variables, its input variables and the global Variables of the program(cf figure 3) . The code of each type of neuron is built like a sequential function. This code describes, for each cycle of the execution, the tasks executed by the neuron. To simplify the implementation, the code that the user has to program is divided into three subfunctions that we call the characteristic functions of the 
B. Definition of some library jknctions
used to initialize and run the network:
The following functions, provided by the library, are
makenet (nb-pc, nb-neuron)
Allows to declare the number of processes nb-pc and neurons nb-neuron used to run the network.
Initialization function
0 neuron( init-fct, iter-fct, tenn-fct, registration) Allows to define a neuron. init-fct iter-fct tenn-fct are the characteristic functions described above, registmtion is the neuron identity.
e x e c u t e n e t ( ) This function executes the network composed with neurons defined with the neuron() function.
Other functions, provided by the library, allow the programmer to build the code of any type of neuron used in the network:
i t s n u t p u t ( myoutput, Copyoutput)
This function is used to declare the output of the neuron myoutput and the function which copies this output copyoutput.
connectin(target)
It is used to create a connection between the neuron and the output of the neuron with target identity. This function returns a channel. We present here one example of implementation with our library : the learning phase of a kohonen map [3] . In order to use parallel properties of this kind of neural model, we design our network with two types of neurons. Each neuron represents one prototype. One particular neuron, which is called master (neuron 0 in our implementation), has an additional role : for each cycle, it determines the winner and the neighborhood of the winner. Each cycle is divided into two steps. Tables I and 11 of the master. This output contains the next example to treat, the registration of the last winner and the neighborhood of the winner. The master is connected to all the others, which send through their output their distance from the last example.
Concerning implementation, we now present the most speciiic steps. First, to define, for each type of neuron, its four characteristic functions (init, iter, term and copy-output), we need to define the various types of the local variable of the neuron (if it is structured). Definition of the type of the master is presented in figure 4 . As master defines one output, we need to write the function to copy it, as figure 5 shows.
The initialization function of each type of neuron is built (see figure 6 for the master). Three important actions are distinguished : (1). The local variables are allocated and defined, (2). The output is defined and (3). Inputs of the master are declared, each link being a connection to one neuron (here link i correspond to neuron i). When each neuron has created its connections, the topology of the network is built. Afterwards, the iteration function of each type of neuron is defined. It describes the work of the neuron for each cycle. Figure 7 illustrates several characteristic actions. In (1) the neuron recovers its local variables, according to which, it chooses its action (2). In (3), the master recovers the input corresponding to its link i, for each i, and it uses this input in (4) (neurons just declare a float, their activation, as output). In (5), learning is ending, the master dies. Finally, the termination functions are defined, like the master's one (figure 8). The main aim of these functions is to make the memory free. In addition, saving weights is also possible. 
VII. Performances
We presents here the performances of the kohonen map, built like presented above. Figure 9 presents results obtained with a 100x100 grid for 100,000 learning iterations. ' language, compiled and executed on the same machine. The cost for using the first version of the library is thus indicated for one processor. Using the library is advantageous beyond one processor. These benchmarks have been executed on an Origin 2000 in multi-users mode. This result reveals two stages in term of speedup. First a good acceleration is due to the increase of the number of processors and of the cache memory. The size of the available cache memory increases with the number of processors. Then, speedup decreases, due to increasing communications and load balancing difficulties .
This first result is satisfactory and has to be improved. It has been obtained with a relatively good example of network, nearby biological models. Networks with backprogation learning algorithm obtain worse results, due to the sequential (by layer) characteristic of this algorithm.
VIII. Conclusion and perspectives
This library makes a bridge between the natural parallel semantics of neural networks and the parallel architecture of modern and general purpose computers.
It facilitates the design of neural networks, and allows their execution onto shared memory MIMD parallel computers. Obtained speedups are interesting for a general purpose tool. Nowadays, we are working to decrease the sequential cost of our library and we are completing its dynamic version.
