In this article we present the implementation of an environment supporting Lévy's optimal reduction for the λ-calculus on parallel (or distributed) computing systems. In a similar approach to Lamping's, we base our work on a graph reduction technique, known as directed virtual reduction, which is actually a restriction of Danos-Regnier virtual reduction.
INTRODUCTION
Jean-Jacques Lévy formally characterized the meaning of the word optimal relative to a reduction strategy for λ-calculus, referring to it as the property that the strategy reaches the normal form (if it exists) and does not duplicate the work of reducing similar β-redexes [Lévy 1978 ].
This characterization was formalized in terms of families of redexes, that is, redexes with the same origin, possibly a virtual one in the sense that two families coming in a configuration producing a new redex originate a new family. Redexes belonging to different families cannot be successfully shared during reduction, whereas for two redexes in the same family, one could find an optimal strategy (i.e., reducing all of them in a single step).
Data structures suitable for an implementation of optimal reduction were presented a long time later [Lamping 1990 ]; the outcome reduction technique introduced by J. Lamping, known as sharing reduction, relies on a set of graph rewriting rules.
In Gonthier et al. [1992] , Lamping's sharing reduction was proved a way to compute Girard's execution formula, which is an invariant of closed functional evaluation obtained from the "geometry of interaction" interpretation of λ-calculus [Girard 1989 ]. This result stirred research in the field of optimal reduction. Specifically in Danos and Regnier [1993] a graphical local calculus, namely virtual reduction (VR), was defined as a mechanism to perform optimal reduction by computing Girard's execution formula. Such a calculus was later refined in Danos et al. [1997] by the introduction of a new graph rewriting technique known as directed virtual reduction (DVR). The authors also defined a strategy to perform DVR, namely combustion, which simplifies the calculus and can simulate individual steps of sharing reduction.
In this article we describe a technique for the implementation of functional calculi. This technique exploits both the locality and asynchrony of computation which is typical in interaction nets [Lafont 1990 ] and derives from the fine decomposition of the λ-calculus β-rule obtained through the analysis provided by the geometry of interaction. Specifically, we present the implementation of a parallel environment for optimal lambda-calculus reduction (PELCR) which relies on DVR and on a new strategy to perform DVR that will be referred to as half-combustion (HC).
Let us stress that any interpreter of an ML-like functional language based on our technique ensures the execution of programs in a parallel (or distributed) environment in a way completely transparent to the user.
Actually, our work is related to a number of other results in the field of parallel implementation of functional programming via the geometry of interaction. However, it exhibits substantial differences from each. The work in Mackie [1997] discusses issues on the possibility of parallel implementations for Lafont's interaction nets. In that work Mackie faces problems of load balancing and fine-grain parallelism, and the proposed solution is a static analysis of the initial interaction net which aims at setting-up a favorable initial distribution of the nodes among processors. This work is related to optimal reduction, since optimal rules (e.g., in Gonthier et al. [1992] ) define an interaction system [Lafont 1990 ]. However, contrary to our work, it does not focus on optimal reduction. Another fundamental difference between our work and Mackie's study is that our approach is dynamic: Load distribution is decided at runtime and the message passing overhead is controlled dynamically as well. Close to our work are also the results in Pinto [2001b] about parallel implementation models for the lambda calculus, and in Pinto [2001a] about the parallel evaluation of interaction nets via the MPINE machine. The main differences with our proposal are that our parallel implementation completely avoids synchronized access to shared data among multiple threads, as instead imposed by MPINE, and allows runtime distribution of computational load associated with reduction steps in a way completely independent of the initial setup of the parallel reduction (i.e., the distribution of computational load for the reduction of a straight path in a virtual net does not need to be entirely evaluated by a prespecified thread). This is not achieved by the parallel model in Pinto [2001b] . In other words, compared to both these results, our implementation is more able to exploit parallelism while performing the reduction. Also, our method embeds a set of other optimizations, further allowing improved runtime behavior, for example, concerning, memory performance at both the application level and operating system level.
With respect to memory usage, we also note that the ability of PELCR to effectively distribute the computational load is a support for achieving memory tractability for all those terms where the management of control operators reaches an exponential size in the number of β-reductions [Asperti and Mairson 1998] , which might become intractable by sequential implementation.
Our parallel implementation has been developed with the C language using a standard interface, namely MPI, for supporting message passing functionalities among processes involved in the computation. These peculiarities make PELCR a highly portable software package; easy to install on a wide set of (possibly heterogeneous) computing platforms.
We also report results of an experimental evaluation of our software package. By the experimental data, we show it has the ability to exploit any form of parallelism intrinsic to the reduction of λ-terms. As a result, we obtain up to 70-80% of the ideal speedup that is, the ideal acceleration as compared to a sequential case, while performing λ-term reductions on last-generation multiprocessor systems. This in turn allows decreasing the wall-clock time for the reduction from several tens of seconds to few seconds. This points out PELCR's potential to cope with response-time requirements for the satisfaction of an interactive end-user, even in case of jobs that would require large computation time if executed in classical sequential fashion.
We analyze the problem (parallel implementation of functional calculi) from a pragmatic point-of-view, and the theory of (directed) virtual reduction is here considered mainly for how it can give rise to parallel dynamics. However, beyond recalling such a theory, we also propose a few optimization rules allowing an increase in the effectiveness of the DVR approach, which were taken into account while developing the implementation.
The remainder of the article is structured as follows. In Section 2 we recall DVR. The optimization rules for DVR are proposed in Section 3. In Section 4 the HC strategy for DVR is introduced. In Section 5 we describe our implementation. The experimental results are reported in Section 6.
FROM LAMBDA TERMS TO DIRECTED VIRTUAL REDUCTION
The preliminary step for our work is Danos and Regnier's construction of a confluent, local, and asynchronous reduction of λ-calculus, derived from a semantic setting based on a unique type of move (simple enough to be easily mechanized) [Danos and Regnier 1993] . Their graph reduction technique, namely virtual reduction (VR), can be explained also as an efficient way to compute Girard's execution formula. We recall that the execution formula associated with a term T with free variables {x 1 , . . . , x n } is given by the set of its border-to-border weighted straight paths in its dynamic graph R T . Any node p i in the border is either associated with one free variable x i or, if i = 0, it represents the root of the term (see, e.g., Figure 1 ). A straight path in a directed graph is a path that never bounces back in the same edge. Specifically, the execution formula of
where W (.) is a morphism from the involutive category of paths P(R T ) to the monoid of the geometry of interaction, so that for any straight path φ i j from p i to p j , W (φ i j ) is an element of that monoid.
The one and only reduction rule of VR is the composition of two edges in the graph, as described in Figure 2 . Whenever two edges of the virtual net are composable (i.e., the product of their weights is nonnull), VR derives from them a new edge. The original edges are then marked by the rest of the composition, here denoted by weight within brackets.
The algebraic mechanism corresponding to the rest is called the bar; it was introduced in Danos and Regnier [1993] to ensure the preservation of Girard's execution formula. Note that VR induces bars of bars by definition; this is shown in Figure 3 . DVR, presented in Danos et al. [1997] , was designed in order to avoid bars of bars, thus allowing any implementation to use simple data structures for representing edges.
In the remainder of this section we provide some additional details on the geometry of interaction (Section 2.1) and on VR (Section 2.2); then we will give a full presentation of DVR (Section 2.3). Such a background will be exploited for the description of some optimization rules for DVR that we propose in Section 3 and will also form the basis for understanding the parallel implementation. To ease comprehension of such a reduction technique and to make a more direct connection with Lamping's graphs, this section finally presents an encoding of such graphs into directed virtual nets (Section 2.4).
Algebraic Setting for the Geometry of Interaction
The basic step of geometry of interaction is the introduction of a suitable algebraic structure in view of modeling the dynamics of the reduction. Starting from the first work on this subject [Girard 1989 ] where the algebraic setting was given in terms of C -algebras, several different structures have been studied with different purposes. One purpose has been to provide a bridge between operational and denotational semantics. In this context we recall the work in Abramsky et al. [1994] on full abstraction, which also gave rise to the study of game semantics for programming languages, and the work in Mascari and Pedicini [1998] on partially additive categories, recently extended by Haghverdi and Scott [2004] to cope with the traced monoidal categories originally proposed by Joyal et al. [1996] . Another purpose has been constructing operational semantics for classical logic in Laurent [2001] , successively addressed by Führmann and Pym [2004] through also using traced monoidal categories.
In this work we consider the algebraic structure for the geometry of interaction proposed in Danos and Regnier [1993] , which, as we already said, has been shown to provide the algebraic abstraction of the optimal reduction for lambda calculus. This structure can be thought of as the set of partial one-to-one maps u with composition. The structure is then enriched with partial inverses u , the codomain operation u , and the complementary of the codomain [u] . The axioms for such a structure are formally introduced next.
Definition 2.1. An inverse monoid (see Petrich [1984] ), or an im for short, is a monoid with a unary function, the star, denoted by (.) , with
We denote by u the idempotent uu . With this notation the last equation becomes u v = v u and the one before becomes u u = u.
Definition 2.2. A bar inverse monoid, or a bim for short, is an im with a zero, denoted by 0, and a unary function, the bar, denoted by [.] , with
[1] = 0 and [0] = 1,
Bim's axioms entail:
( 
Now we give the construction of the free bim generated by a given im. Let S be an im, and Z[S] denote the free contracted algebra over S with coefficients in Z (the ring of integers). In other words, Z[S] is the algebra of maps from S to Z with finitely many nonzero values. Thus, Z[S] is the algebra of linear combinations over S with coefficients in Z.
For any such linear combination s = n i s i , define u, v ∈ S n , so that u v, u and [u] belong to S n and so to [S] ω ; for the same reason bim's axioms are satisfied.
Virtual Reduction
Definition 2.6. The monoid L of the geometry of interaction is the free monoid with a morphism !(.), an involution (.) and a zero, generated by p, q, and a family W = (w i ) i of exponential generators such that for any u ∈ L
where δ x y is the Kronecker operator, e i is an integer associated with w i (called the lift of w i ), and i is called the name of w i (we will often write w i,e(i) to explicitly note the lift of the generator). Eqs. (8) are called of annihilation and Eqs. (9) are called of swapping.
Orienting Eqs. (8-9) from left-to-right, one gets a rewriting system which is terminating and confluent. The nonzero normal forms, known as stable forms, are the terms ab , where a and b are positive (i.e., written without s). The fact that all nonzero terms are equal to such an ab form is referred to as the "ab property." From this, one easily gets that the word problem is decidable and that L is an inverse monoid.
Every computation from now on will take place in the bar closure of L in Z[L ], which we denote by [L ] ω . Given that this is a bim, the results in Danos and Regnier [1993] , which were stated and proved for any bar inverse monoid, are applicable with no further ado. Note that equalities in [L ] ω and in Z[L ] are also decidable by rewriting to stable form. Most of the time, we will simply write φ for W(φ) to ease reading of definitions and proofs.
We will say that α coincides with β, or equivalently that α and β are coincident, if they have the same target node. An edge β is called a counter-edge of α along τ if β = α and τ is a directed path from α's to β's target not ending with β, such that α τ β = 0 (see Figure 4 ).
Two coincident counter-edges α and β are said to be composable (i.e., α β = 0, or equivalently they are reciprocally counter-edges along the empty path; this is the case of Figure 4 when τ is absent).
Definition 2.9. A straight path is a path that contains no subpath of the form φφ , namely, that never bounces back in the same edge.
A weighted directed graph is said to be split if any three coincident paths of length one φ 1 , φ 2 , and φ 3 are such that φ 1 φ 2 φ 3 = 0; it is said to be square-free if for any straight path φ, φφ = 0.
Definition 2.10. A weighted directed graph is said to be a virtual net if it is split and square-free.
Splitness can be rephrased as any three paths φ 1 , φ 2 , and φ 3 such that none is a prefix of another such that φ 1 φ 2 φ 3 = 0.
Directed Virtual Reduction
Definition 2.11. A directed virtual net R is an acyclic virtual net such that for each edge α:
where a, b 1 , . . . , b n are positive monomials of L . We will denote by α + the weight of α without its filter [b 1 , . . . , b n ] that is, the monomial a.
(B) For any i = j and for any two counter-edges β 1 , β 2 of α along τ 1 , τ 2 :
Given two coincident counter-edges α and β with weights [b 1 , . . . , b n ]a and [a 1 , . . . , a m ]b, DVR originates a new node, and two new edges linking that node to the sources of α and β. These new edges have, respectively, weights b and a , where a b is the stable form of b a; this is shown in Figure 5 . Overall, condition (B) in Definition 2.11 states that monomials in the filter of the weight of an edge are pairwise orthogonal, and that the property of monomial orthogonality is preserved by DVR. Note that new edges produced by a step of reduction have positive weights so that the resulting computation of the execution formula is more appealing for the implementation, as opposed to VR, by the fact that bars are not propagated on residuals.
Definition 2.12. Given two composable edges α and β, the two edges α and β generated by one step of DVR are called residuals of α and β, respectively. We will denote these residuals by
LEMMA 2.13 (AUGMENTATION). Let R be a directed virtual net and γ 2 be a counter-edge of γ 1 along τ in R. Then
By the augmentation in Danos et al. [1997] , it has been shown that directed virtual nets with DVR are a complete calculus. In Danos et al. [1997] , it has also been proved that DVR is sound with respect to Girard's execution formula.
PROPOSITION 2.14 (INVARIANCE). The execution formula is an invariant of DVR.
Translation of Sharing Graphs into Directed Virtual Nets
In order to solve the problem of the pairing of duplication operators, Gonthier et al. [1992] added to the sharing graphs a local-level structure. Each operator is decorated with an integer tag that specifies the level at which it lives. Furthermore, in order to manage these levels a set of control operators is required.
More precisely, sharing graphs are nonoriented graphs built from the indexed nodes represented in Figure 6 . These nodes are called sharing operators and distinguished in two groups. The first group includes operators in Lamping's original work: application, and abstraction. The second is constituted by a family of nodes of the same kind (so-called muxes which act as the composition of binary fan nodes and control operators) according to the following definition.
Definition 2.15. A node mux or multiplexer is a node with an arbitrary number of premises, each having a name n and a lift l n ; like other nodes, muxes have an index of level i.
The translation of a sharing graph with muxes is defined by induction.
Definition 2.16. A sharing graph M with root x and context y 1 , . . . , y n is translated into a directed virtual net in the following way:
where bullets indicate ports of a sharing graph.
-If x is a link between two ports with no node:
-If x is a port of an abstraction: -If x is a port of an application:
-If y k is a port of a mux with m premises y 1 , . . . , y m , each one of lift e i : When the nodes introduced by the translation present the configuration described in the next definition, they are reduced by amalgamating edges as in Figure 7 .
Definition 2.17. A node with n coincident edges α 1 , . . . , α n and an edge β, with as source of the target the α i 's, is erased and all the α i 's are replaced by edges α i , where the source of α i is the source of α i , the target of α i is the target of β, and the weight of α i is α i β.
With the help of an example, we show how to change a λ-term into a directed virtual net. In Figure 9 , starting from the syntactic graph of the λ-term representing the Church numeral 2 applied to the identity, that is, (λ f x. f ( f (x)))(λx.x), we obtain a sharing graph by adding the control operators expressed in the multiplexer syntax, and annotating each node by level indexes.
Then edges are oriented, unfolded, and labeled with monomials in L in accordance with the rules expressed by Definitions 2.16 and 2.17. Last step consists of grouping together arrows going in the same direction. The result of this operation is a directed virtual net; see Figure 9 . Let us check that the obtained net is indeed a directed virtual net:
-It is obviously a directed graph with no circuits; -square-freeness can be proved by induction on the translation of λ-terms; and -splitness has to be verified for all triples of coincident edges; as an example we explicitly test the splitness condition for three coincident edgespw 2,1=pw 2,1 w 2,1 p= qpw 2,1 w 2,1 p= 0 because p q = 0; the rest of this verification is left to the reader.
As a final observation, once the original virtual net is reduced, the resulting lambda term in normal form can be obtained through a so-called read-back procedure acting on the reduced net, such as the one described for the case of sharing graphs in Asperti and Guerrini [1998, Chap. 11] .
OPTIMIZATION RULES FOR DVR
In this section, in order to make DVR more effective, we prove two properties which are also exploited while developing the implementation. These properties are immediate consequences of the orthogonality conditions satisfied by virtual nets and help to gain effectiveness in the computation and to increase the intrinsic parallelism.
Definition 3.1. Given a directed virtual net R, a total edge α is any edge with at most one counter-edge β, that is, β is the only edge such that α τ αβ β = 0.
In this case we say that α is total with respect to β; if α has no counter-edge, it is called a ghost.
This relation is not symmetric: For any two coincident edges α and β such that α is total with respect to β, we observe that it is possible that β is not total with respect to α. In fact, suppose β = 1, then any γ coincident with β is composable with β (thus β cannot be total with respect to α) but α is not composable with γ ; otherwise, it would contradict the splitness condition, thus α is total with respect to β. PROPOSITION 3.2. Given two composable edges α and β such that dvr(α, β) = (1, α ), then α is total with respect to β.
PROOF. In order to get a contradiction, suppose that there exists a counteredge γ of α along the directed path τ . Suppose τ is the empty path, therefore γ coincides with α and α γ = 0 so that dvr(α, γ ) = (γ , α ); in this case we compute γ α β and have γ γ αα ββ = γ α γ α β = 0, since it is a stable form, and get a contradiction with the splitness condition.
If τ is not the empty path, we can apply the same argument to the residual γ of the reduction sequence along the directed path τ , and derive the property by Lemma 2.13 applied to α and γ . PROOF. Suppose γ β = 0 so this product has a stable form, let, say,β γ , then γ α β = γ γ αα ββ = γ α γ β β = γ α β γ β = 0 and we get a contradiction with the splitness condition. PROOF. The two residuals in the source of α have weight 1 so that they are composable and this is a contradiction to Proposition 3.3.
This corollary allows an optimization rule; indeed, the configuration produced by the DVR step dvr(α, β) = (1, α ) acts as a compound operator: The edge with weight 1 is there just to say that all coincident edges have to be transferred on the source of the edge α , so we propose to transform this configuration by removing the edge β with weight 1, using the edge α for linking the target of β and the target of α (see Figure 10 ). Now we will prove another property of DVR which states that when two edges α 1 and α 2 are residuals of directed virtual reduction of α 1 and α 2 against the same edge β, they are coincident on the source of β (evident by definition of a DVR step), but not composable because of splitness. PROPOSITION 3.5. Given an edge β composable with two coincident edges α 1 and α 2 , we have that α 1 α 2 = 0, where α 1 is the residual of α 1 and α 2 is the residual of α 2 .
PROOF. Suppose α 1 α 2 = 0, then we have α 1 α 1 α 2 α 2 and so α 1 α 2 = α 2 α 1 = 0.
By augmentation Lemma 2.13, we have
and by reduction we obtain α 1 β α 2 α 1 β α 2 and so this is a nonnull stable form, so that it is different from zero. Property 3.5 allows the implementation of another optimization rule. Specifically, we know that every new node v created after a DVR step is the source of only two edges, say β 1 and β 2 , therefore all the edges coincident in v can be separated in two sets: the residuals of a DVR step involving β 1 and the residuals of a DVR step involving β 2 . Each edge in a set is orthogonal to any edge in the same set, therefore there is no need to perform DVR steps between edges belonging to the same set, since the composition will actually produce a null result.
HALF-COMBUSTION STRATEGY
In Danos et al. [1997] , a strategy called combustion is presented in order to organize DVR in such a way that no filter must be kept. This strategy works on full directed virtual nets that are directed virtual nets where each edge is either a ghost (see Definition 3.1) or has a positive weight.
Since a ghost edge is an edge for which no more compositions will occur, the sources of ghost edges never receive residual edges of ghost edges, thus let us define the (out-)valence of a node as the number of nonghost edges having that node as the source.
The combustion strategy of a full net starts from a node v of valence zero (i.e., with no future incoming edge, or equivalently having only ghost outgoing edges) and composes all the pairs of coincident counter-edges on v as an atomic action. Using the combustion strategy, we can give up filters because after the composition is performed, all these edges become ghost edges.
From the point-of-view of a parallel implementation, the drawback of this strategy is that composition of coincident counter-edges can only be started when a node becomes of valence zero. More specifically, in the case where many processes are used to perform DVR (recall this is desirable anytime we want to fully exploit the computing power of parallel or distributed systems equipped with a large number of processors), we might incur the risk that at a given time instant, only a subset of these processes host nodes of valence zero. In such a case, all other processes cannot simultaneously proceed with DVR steps (i.e., they need to wait until some node they host becomes of valence zero), thus limiting the degree of parallelism while performing the reduction.
We define in what follows the HC strategy that, like combustion, does not require to keep filters and in addition allows the composition to be performed even on nodes having valence greater than zero, thus allowing a high degree of parallelism. Given that, as discussed in Pedicini [1999] , the combustion strategy strictly simulates the full algorithm for Lamping's sharing reduction proposed in Gonthier et al. [1992] , the HC strategy we propose is actually a way to perform such a reduction in a less synchronous fashion.
HC relies on the following notion of a semifull directed virtual net which is a generalization of the notion of a full directed virtual net.
Let us call a semifull directed virtual net a directed virtual net in which each edge either is weighted by a positive monomial (i.e., its weight has no filter) or all of its coincident counter-edges are weighted by a positive monomial (i.e., it can be composed exclusively with edges having a positive weight). An example of a node in a semifull directed virtual net is shown in Figure 11 . In this example, the coincident counter-edges of edges with weight [a i1 , . . . , a i j 1 ]b i are among those edges weighted with a 1 , a 2 , . . . , a m .
Next we give the definition and provide the soundness of the HC strategy.
Definition 4.1. Given a composable edge α with positive weight in a semifull directed virtual net R, we have to consider two cases:
(1) If α has no nonpositive coincident counter-edge and a positive one β, then the half-combustion strategy (HC) performs the composition of β with α and possibly with every nonpositive edge composable with β; or (2) if the set {β 1 , . . . , β n } of nonpositive edges composable with α is nonempty then HC performs all possible compositions of α with the β i s.
PROPOSITION 4.2. If R is obtained from the directed virtual net R by the HC strategy and R is semifull, then so is R .
PROOF. Consider an edge α having positive weight a as in the Definition 4.1, and suppose we stay in the second case of the definition, that is, all the composable edges with nonpositive weights coincident with α are the β i 's with weights [a i1 , . . . , a i j i ]b i for i = 1, . . . , n, as in Figure 12 .
If we apply a step of the HC strategy by performing a DVR step between α and β i for 1 ≤ i ≤ n, we obtain Therefore, now the set of coincident filtered edges has been enlarged with α, but α is no longer composable with the β i 's because of its filter. Also, all the generated edges α i 's and β i 's have positive weights (by the definition of DVR). As a consequence, all coincident filtered edges (including α) are not composable with each other. Thus the obtained directed virtual net is semifull.
If we stay in the first case of Definition 4.1, performing the composition of β with all its nonpositive coincident counter-edges, we obtain the same configuration as in the previous case. Moreover, we compose α with (the residual of) β and so all the nonpositive edges incident in the node are not further composable. Note that the set of nonpositive edges composable with β can possibly be empty, in which case HC just composes α and β.
We recall that the translation presented in Section 2.4 associates with any λ-term a full directed virtual net [Danos and Regnier 1993; Gonthier et al. 1992] . As full nets are particular instances of semifull ones, HC actually represents a reduction mechanism for λ-calculus.
Beyond the exploitation of parallelism, another interesting property of HC is that we can separate the edges ending on a node into two sets. In other words, the strategy associates a mark with each edge: either incoming or combusted. When created, edges are marked as incoming. One step of reduction consists of picking an incoming edge α and performing all the compositions with coincident combusted edges. Then α is marked as combusted.
Note that an edge may be marked as combusted even when it has a positive weight, namely, if all the combusted edges coincident with α are not composable with α, particularly when the set of combusted edges coincident with α is empty as in case (1) of Definition 4.1 (i.e., its weight belongs to L and this allows discarding the management of filters with benefits for any implementation). On the other hand, at any step any incoming edge has a positive weight. As an edge is marked combusted only after having been (successfully or not) composed with every coincident combusted edge, one easily sees that two combusted edges are never composable. This suggests that we can organize the computation such that the only meaning associated with filters is in regard to whether an edge belongs to the first or second set (thus, like in the combustion strategy, filters • M. Pedicini and F. Quaglia can be actually discarded). We have embedded this simplification among others in the parallel implementation we present in the next section.
THE IMPLEMENTATION
This section is devoted to describing the implementation of PELCR and is organized as follows. We first provide the outline of data structures we have used and the high-level description of the parallel program. Then we add details on any aspect and/or optimization characterizing the implementation. Actually, the material presented in this section mostly describes the implementation at an abstraction level independent of the specific language and specific messagepassing layer used to develop it. However, some details of the specific implementation relying on the C language and MPI are provided to better outline the relation between the theoretical framework (based on DVR and HC) and the implementation itself.
Data Structures and Code Organization
Each processor i of the architecture hosting PELCR runs a process P i which is an instance of the executable code associated with the parallel program. We assume there is a master process, which for the sake of clarity will be identified as P 0 . All other processes will be referred to as slave processes. Processes communicate exclusively by exchanging messages, and the communication channels among processes are assumed to be FIFO (this is not a limitation, as the most widely used message-passing layers, such as PVM or MPI, actually provide the FIFO property to communication channels). We call pending any message already stored in the communication channel which has not yet been received by the recipient process.
We associate with each node v an identifier, namely, id (v). Each edge e = (v 1 , v 2 ) is therefore associated with the pair of node identifiers (id (v 1 ), id (v 2 )), thus the weighted edge is represented by the triple (id (v 1 ), id (v 2 ), W (e)).
As discussed in Section 3, by Property 3.5 any edge e incident on a node v can be seen as belonging to one of two distinct sets, depending on which of the two edges having v as the source originated e through composition. We call the two sets of edges the left set and right set, and we associate with each edge e additional information, namely Side(e), indicating whether e belongs to the left or right set. This information allows us to reduce the number of edge compositions due to the HC strategy which must be performed during the computation. Specifically, given two edges e and e incident on the same node v, if the side of the two edges is the same, no composition involving e and e must be performed at all, since we a priori know that it will produce null result. On the other hand, if Side(e) = Side(e ), composition must be performed to determine the result, which can be either null or nonnull.
In the general case, each process P i hosts only a subset of the nodes of the graph. Therefore, given an edge e = (v 1 , v 2 ), there is the possibility that v 1 and v 2 are hosted by distinct processes. In Figure 14 , we show an example. The interesting point in the example is that when process P i performs the composition between the edges e 1 and e 2 incident on node v according to HC, then a new node, namely v , is originated together with two new edges, namely e 3 and e 4 , incident on nodes v 1 and v 2 , respectively. The new node v can be hosted by any process, and process P i is the one which establishes where v must be actually located; in our example, P i selects P j . We will come back to the selection issue in Section 5.3 when describing the load-balancing module that establishes how new nodes must be distributed among processes. Note that in the case where one of the newly produced edges should have weight 1, the "optimization of one" rule described in Section 3 (see Figure 10 ), allows avoiding the real creation of that edge. Also, the only edge really created has as the source a node already within the directed virtual net, thus no new node needs to be created and addressed to some process.
In our implementation id (v ) is a triple [t, P i , P j ], where P i is the process that creates the node v , P j is the process hosting that node, and t is a timestamp value assigned by P i . The timestamp is managed by P i as follows: It is initialized to zero and anytime P i originates a new node, it is increased by one. Process identifiers are treated as int values, while the timestamp associated with the edge is represented as a long.
For what concerns the weight W (e) of the edge e, representing a monomial in L (see Definition 2.6), we have used a string representation. Also, the composition rules between two edges in Definition 2.6, proper of the geometry of interaction, have been implemented by relying on standard C libraries for string manipulation. Through such library functions, the edge composition is implemented as an operation with linear cost in the length of strings associated with the weights of the two edges we are composing. The information Side(e) related to the side (left or right) of the edge e, is stored as an int.
When the new node v is originated by P i , the creation must be conveyed to P j . Furthermore, both P h and P k must be notified of the new edges e 3 and e 4 incident, respectively, on v 1 and v 2 . In our implementation we use message • M. Pedicini and F. Quaglia exchange only for notification of new edges, while we avoid explicitly notifing P j of the creation of the new node v . Process P j will actually create the node v upon receipt of the first message reporting a new edge incident on v . We will refer to this type of node creation as delayed creation. It allows us to reduce the amount of notification messages exchanged among processes.
Applying the delayed creation technique to the example in Figure 14 means that node v 1 is created by P h only upon receipt of the message carrying information of the edge e 3 incident on v 1 (recall that this message is sent by P i ). Similarly, P k will create v 2 only upon receipt of the notification message for the edge e 4 (this message is also sent by P i ).
By the previous considerations, we get that any message exchanged between two processes carries the information of a new edge. Specifically, a message carrying the information associated with the edge e(v x , v y ) has a payload consist-
is the weight of e, and Side(e) is the edge side. P i tracks the information related to local nodes in a list nodes i . Any element in nodes i has a compound structure, and the list is implemented as a classical C linked list with dynamic memory allocation for the entries. In the remainder of the article, we identify the structure in nodes i associated with a node v as nodes i (v). As a relevant field of the structure nodes i (v) we have another list, namely nodes i (v).combusted, containing the edges incident on the node v which have already been composed (i.e., combusted edges of the HC strategy).
The list nodes i (v).combusted is partitioned into two sublists containing edges having Side() equal to left and right, respectively, namely nodes i (v).combusted.left and nodes i (v).combusted.right.
A buffer incoming i associated with P i is used to store received messages. For what we have explained earlier, any message stored in incoming i carries information related to a new edge which must be added to the virtual net and composed with already combusted edges (if any) incident on the same node. Such an edge is actually an incoming edge of the HC strategy. Therefore, the buffer incoming i represents a kind of work list for process P i , as, according to HC, any incoming edge associated with a message stored in incoming i requires P i to compose it with all the already combusted edges incident on the same node. Performing such a composition represents the work associated with the message carrying the edge.
In our implementation, when an edge e having the node v as the target is extracted from incoming i , the virtual address of the structure nodes i (v), if already existing, is retrieved efficiently through a hashing mechanism with chaining for handling collisions. Hence, composing the edge e requires a scan operation of the list nodes i (v).combusted related to the target node.
For each process P i , except for the master process P 0 , both incoming i and nodes i are initially, empty, meaning that initially, there is no node of the directed net managed by P i , nor are there incoming edges for it. Instead, P 0 is such that its list nodes 0 is empty but its buffer incoming 0 contains a set of messages, one for each initial edge of the virtual net (recall that the initial edges are all incoming). Note that this does not mean P 0 is a bottleneck for the parallel execution, since the load-balancing mechanism we have implemented (see In Figure 15 we show the high-level structure of the algorithm implemented by the software modules we have developed. Before entering the pseudocode description, we recall that the HC strategy is such that any incoming edge, of which process P i becomes aware by extracting the corresponding message from incoming i , must be immediately composed with the preexisting edges incident on the same node without additional delay. Furthermore, given a message m carrying the information of a new edge e = (v 1 , v 2 ), we denote as m.target the node identified by the information id (v 2 ) carried by m (recall that id (v 2 ) is the previously described triple), and as m.source the node identified by the information id(v 1 ) carried on the same message. Also e.target and e.source have similar meanings when referring to an edge e. Furthermore, we denote as e m the edge carried by m.
The procedure initialize() sets the initial values for all data structures. The procedure empty() checks whether the buffer storing received messages is empty. In the positive case, process P i has no work to be performed, thus it invokes the procedure check termination() to check if the computation is actually ended, that is no message will arrive (in Section 5.4 we will provide details on how the detection of the termination is implemented). In the negative case, it extracts a message from the buffer incoming i and performs the composition of the corresponding incoming edge.
Through the test in line 9, we exploit information about Side(e m ) to avoid unnecessary edge compositions. The pseudocode structure also points out that process P i checks for the presence of pending messages only when incoming i is empty (i.e., when P i has no more work to be performed unless new pending messages carry it). This behavior aims at reducing the communication overhead. Specifically, a procedure to check whether there are pending messages is typically realized by using probe functions supported by the used communication layer. P i invokes the execution of a probe function to test whether there is at least one pending message. If there is at least one such message, then a recv procedure is executed to receive that message and store it into incoming i . As pointed out in other contexts [Dickens et al. 1996] , probe functions may be expensive, therefore, they should be executed only when a further delay could actually produce negative effects on performance. In the general case, delaying the probe call until all messages stored in incoming i have been processed should not produce negative effects. This is the reason why, in the general case, we suggest performing the probe call only when incoming i becomes empty. However, we noted that, depending on the particular hardware/software architecture and on the adopted message-passing layer, excessive delays in receiving pending messages could impact negatively on the performance of the communication layer due to buffer saturation. This is the case we have observed for our MPI-based implementation. For this reason, we have lightly modified the general code structure in Figure 15 in order to avoid excessively infrequent probe calls (and message receipts).
Beyond the overhead due to probe calls, another important issue is the overhead related to send and receive operations. A solution to bound this overhead will be discussed in the following subsection. Then we will present the policy we have selected for balancing the load among processes and other aspects related to implementation.
Message Aggregation
The cost of sending and receiving a physical message, paid for in part by the sender and in part by the receiver, can be divided into two components: (i) an overhead that is independent of message size, namely oh; and (ii) a cost that varies with the size of the message, namely s × oh b , where s is the size (in bytes) of the message and oh b is the send/receive time per byte. Typically, oh includes the context switch to the kernel, buffer reservation time, the time to pack/unpack the message, and, in the case of distributed memory systems, the time to set-up the physical network path, whereas oh b takes into account any cost that scales with the size of the message.
Usually, oh is higher than oh b , (as shown in Xu and Hwang [1996] , up to two orders of magnitude), thus is usually more efficient in delivering several information units (i.e., more than one application message) with a single physical message such that a single pair of send/receive operations is sufficient to download much data at the recipient process. This allows reducing the static overhead oh for each information unit, thus enabling efficient parallel executions, especially in the case of fine-grain computations like DVR. As an example, if three application messages of size s constitute the payload of a single physical message, then the cost to send and receive these application messages is reduced from 3oh + 3s × oh b to oh + 3s × oh b .
We present in what follows the optimization we have embedded in communication modules via the aggregation of application messages in a single physical message. Each process P i collects the application messages destined to the same remote process P j into an aggregation buffer out buff i, j . Therefore, each process P i maintains a set of aggregation buffers, each associated with a remote process. Application messages are aggregated and infrequently sent to the destination process via a single physical message. The higher the number of application messages aggregated, the greater the reduction of static communication cost per application message; we call this positive effect aggregation gain (AG). However, the aforementioned simple model for communication cost ignores the effects of delaying application messages on the recipient process. More precisely, there exists the risk that the delay produces idle times on the remote processes which have already ended their work and are therefore waiting for messages carrying new work to be performed; we call this negative effect aggregation loss (AL). Therefore, establishing a well-suited value for the aggregation window (defined as the number of application messages sent via the same physical message) is not a simple task.
In our implementation, the module controlling the aggregation keeps an age estimate for each aggregation buffer out buff i, j by periodically incrementing a local counter c i, j . The value of c i, j is initialized to zero and reset each time the application messages that are aggregated in the buffer are sent. At the end of the composition phase for an incoming edge extracted from the local work list incoming i , c i, j is increased by one if at least one message is currently stored in the aggregation buffer out buff i, j . Therefore, one tick of the age counter is equal to the average combustion time of an incoming edge, and the counter value represents the age of the oldest message stored in the aggregation buffer.
The simplest way to use previous counters is to send its aggregate when its associated counter reaches a fixed value, referred to as maximum age for the aggregate, or when the work list of the process is empty. In this case there is no need to delay the aggregate anymore, as the probability of putting more application messages into it in a short amount of time is quite small, so the delay will not increase AG and will possibly produce an increase of AL. We will refer to this policy as fixed-age-based (FAB). Although this policy is simple to implement and does not require any monitoring for tuning the maximum age over which the aggregate must be sent, it may be ineffective whenever a bad selection of the maximum age value is made.
To overcome this problem we have implemented a variable-age-based (VAB) policy which is an extension of FAB, having similarities with an aggregation technique presented in Chetlur et al. [1998] for communication modules supporting fine-grain parallel discrete event simulations. In VAB, anytime that the messages aggregated in out buff i, j are sent, the message rate achieved by the aggregate is calculated. This rate is used to determine the ideal maximum age for the next aggregate. Dynamically changing the maximum age after which an aggregate must be sent allows the aggregation policy to adapt its behavior to that of the overlying application.
To implement VAB, P i is required to maintain an estimate est i, j of the expected arrival rate in each aggregation buffer out buff i, j that is, an estimate of the speed at which out buff i, j is filled (the higher such a rate, the higher the AG for that buffer). This estimate is initialized to zero and then updated by using statistics related to a temporal window, namely, arrival rate samples observed in the recent past of the execution. If the arrival rate for the current aggregate in out buff i, j is higher than est i, j , then the maximum age for the next aggregate in that buffer is increased by one, since the application is likely to start a period of bursty exchange of application messages from P i to P j . Therefore, a slight increase in the maximum age is likely to relevantly increase the AG. If the arrival rate falls below est i, j , then the maximum age is decreased by one (provided it is greater than one). An upper limit on the maximum age can be imposed in order to avoid negative effects due to AL (i.e., in order to avoid excessive delay in delivery of the aggregate at the recipient process).
Load Balancing
Whenever the composition between two edges is performed by a process P i , a new node is originated and P i must select a process P j (possibly j = i) which will host the new node. In order to provide good load balance, we have implemented a selection strategy for the destination process which uses approximated state information related to the load condition on each process.
In our solution we identify the number of unprocessed application messages upm stored in the buffer incoming i as the state information related to the load condition on P i . The process P i tracks the values of upm related to itself and other processes into a vector UPM i of size n (where n is the number of processes). UPM i [i] records the current value of the number of unprocessed application messages of P i . Furthermore, UPM i [ j ] records the value of the number of unprocessed application messages of P j that are known by P i . These values are updated as follows. Whenever P i sends a physical message M to P j , the value of UPM i [i] is piggy-backed onto the message, denoted M .UPM. 1 Whenever a physical message M sent by P j to P i is received, then UPM i [ j ] is updated from M .UPM (i.e., UPM i [ j ] ← M .UPM). The information on load conditions kept by the UPM vectors is approximated for two reasons: -There exists the possibility that when P i receives M from P j , the current value of UPM j [ j ] is different from M .UPM; and -the current value of UPM i [i] is not an exact representation of the current load of P i , as it does not count the application messages carried by pending physical messages. These application messages represent work to be performed which has not yet been incorporated into the buffer incoming i .
We note, however, that obtaining more accurate state information on the load condition of a process would require the exchange of additional physical messages, or, at worst, a synchronization among processes which could produce unacceptable negative effects on the performance. In any case, it is important to remark that the FIFO property for communication channels guarantees that each time a physical message M sent by P j to P i is received the piggy-backed value M .UPM refers to a more recent load condition as compared to the one indicated by the current value of UPM i [ j ] . With respect to the freshness of information maintained by UPM i , we have decided not to increment UPM i [ j ] when P i sends to P j a physical message M carrying a given amount x of application messages. This choice is towards maintaining the freshness of UPM i by exploiting information on the current real load of P j carried by the physical messages explicitly sent by P j to P i .
Based on the values stored in UPM i , we have implemented a selection policy for the destination process of a new node, which is a modified round-robin. It works as follows. P i keeps a counter rr i initialized to zero, which is updated (modulo n) each time a new node is produced by P i . The current value of rr i is the identifier of that process which should host the new node according to the roundrobin policy. P i actually selects P rr i as the destination if UPM i [rr i ] < UPM i [i]. Otherwise, P i selects itself as the destination for the new node. In other words, each process distributes the load in round-robin fashion, unless at the time that the load distribution decision must be taken, the local load is lower than that of the remote process which should be selected. We have decided not to select the least loaded process identified by the information maintained in UPM i . This is so as to avoid load imbalance in executions with a very large amount of processes in case the majority of these processes simultaneously selects that same process as the destination for new nodes; this would originate a peak of load on that process.
Termination Detection
The implementation of termination detection relies on the use of additional control messages. Specifically, anytime a process distinct from the master P 0 has no more work to be performed, it sends to P 0 a message carrying information about both the number of application messages received from other processes which have already been processed, and the number of application messages produced for other processes. Such a message will be referred to as a status message. P 0 detects that the computation is over when the information carried by status messages tells P 0 that each process has already processed all the application messages it has received. The termination is notified by the master process to slave processes through transmittal of termination messages. Looking at the structure of the pseudo code in Figure 15 , it can be seen that a process P i executes the check termination() procedure only when no work to be performed has been detected (i.e., when incoming i is empty). This shows that no synchronization is required (i.e., P i sends its status message without blocking to receive an acknowledgment; it will possibly receive the termination message during a future execution of the check termination() procedure). 2 The master P 0 checks for incoming status messages and possibly sends termination messages only when it has no more work to be performed. Note that the consistency of the information collected by P 0 through status messages is guaranteed by the FIFO property of communication channels. 
On-the-Fly Storage Recovery
At the end of the computation we get that for all nodes of the final graph, the incident edges are ghosts. However, only some of those nodes belong to the normal form of the reduction. We recall that the nodes of the initial graph belonging to the normal form are those initially having only ghost incident edges; this set of nodes will be referred to as the border. Starting from the border, we can determine the whole normal form: It contains those nodes linked to the border by a directed path.
We have embedded in the implementation a technique to discard on-thefly nodes that do not belong to the normal form. We have made this design choice for keeping the memory usage low with the twofold aim of: (i) increasing the efficiency of the underlying virtual memory system, and (ii) allowing efficient management of the data structures maintained at the application level. With respect to the latter issue, anytime an unprocessed application message m is extracted from the buffer incoming i (see line 5 of the pseudocode in Figure 15 ), process P i must access information associated with the node m.target, if it already exists. Such a information is maintained in the structure nodes i (m.target) (see line 6 of the pseudocode in Figure 15 ). As pointed out in Section 5.1, to retrieve the virtual memory address for this structure, we use a hash table with chaining for handling collisions, which keeps an active entry for each node in the list nodes i . Discarding nodes of valence zero that do not belong to the normal form allows keeping low the number of entries in the hash table, thus allowing highly efficient access to the table at any time.
The on-the-fly storage recovery technique we have implemented tracks each instance that a node becomes of valence zero, and removes it if there is no directed path towards nodes of the border. This is implemented through a particular type of application message we call an EOT (end-of-transmission) message. These are used just to track the absence of those paths towards nodes of the border. Specifically, for each initial node v that does not belong to the border, we insert an EOT message. This message is used to tag that node as removable once the composition of its incident edges has been performed. If we ensure that EOT messages are processed only after all those messages carrying edges destined to v have been processed, then we detect upon processing the EOT messages that no new edge will have v as its target. This means that node v, as well as all edges pointing to it, can be deleted. Before removing this node, the EOT message is propagated to the source nodes of the edges pointing to v. We note that since each node is the source of two edges, we expect the arrival of two EOT messages (one from each side) before handling the removal of the node. Therefore, if for a node v we have no EOT message destined to it, or at most only one such message, it means that v has a directed path to the border, thus it belongs to the normal form. In this case no removal takes place.
Actually, the guarantees that EOT messages will only be processed after all messages carrying incoming edges destined to the same node have been processed are trivially achieved, thanks to the FIFO property of communication channels.
EXPERIMENTAL RESULTS
In this section, we report experimental results demonstrating the effectiveness of our implementation, and thus of both the HC strategy underlying it and the combination of all the optimizations for the runtime behavior we have presented and embedded within PELCR. As already mentioned, the implementation has been developed using the C language and MPI as the underlying message-passing layer. A major advantage of using such a standard interface for message-passing functionalities is that it makes the software highly portable. This allowed us to test the implementation on a wide set of platforms such as SMP machines with Linux, IBM mainframes like SuperPower 3 and Su-perPower 4 with AIX, shared memory Sun Ultra Sparc machines, and Alpha Digital microchannel clusters.
In this section, we report performance results obtained in the case of an IBM server pSeries 690, with 32 Power4 CPUs with (1.3 GHz) and 64 GB RAM, running IBM AIX 5.2 ML1+ as the operating system. We have selected the results obtained with this machine as the representative especially because of its larger number of available processors as compared to the other architectures. This allows us to better observe whether and how the performance provided by PELCR scales while increasing the computing power.
Before describing the details related to the experimental results, we note that there exists an approach to reduce the computation time in optimal reduction systems based on an optimization known as safe operators [Asperti and Chroboczek 1997] . It allows merging many control operators into a compound one acting as the sequence, thus exhibiting the ability to strongly decrease the number of interactions. Actually, this approach has been the basis for an implementation of the so-called BOHM (bologna optimal higher-order machine), a sequential machine for optimal reduction that has been demonstrated to provide better performance compared to nonoptimal interpreters such as CAML and HASKELL [Asperti and Guerrini 1998 ]. Compared to this approach, we tackle the issue of increasing the speed of the reduction in an orthogonal way. Specifically, we do not use merging of operators to reduce the number of interactions, but rather exploit the computing capabilities of multiple processors within the architecture to keep the reduction time low via parallel computation of the reduction itself. This approach can be applied to any computation issued from the geometry of interaction. Note that we experimentally observe speedups in parallel evaluation of terms typeable in systems with implicit computational complexity ELL (or LLL) [Girard 1995] ; these terms are evaluated such that safe operators optimization is not applicable. As a consequence, our approach is expected to speed-up execution, even for cases in which safe operators cannot be effectively employed, or when optimal reduction does not affect the efficiency of computation like in nonhigher-order terms.
The results we report in this section refer to the following two test cases: 2), 1, 4) )),
where the term Ite represents the iteration function taking as input a function step, a function base, and an integer n, iterating the application of step to base n times, and where the term Mult simply performs the multiplication of two integer values. Its normal form represents the iterated exponential 2 2 2 4 . For the precise definition of this ELL proof net from which the dynamic graph to be executed is obtained, we refer the reader to Pedicini [1996] .
For both of these two cases, the shared result of the HC strategy has a number of nodes on the order of hundreds of thousands (for DD4, this number even reaches about one and-a-half million), therefore they are large enough case studies to stress the behavior of our implementation.
Before presenting the results, we provide details on the main parameters we have measured in the experiments.
Measured Parameters
A measure of success of any parallel implementation is how significantly it accelerates the computation. Typically, the acceleration is expressed by the socalled speedup, evaluated as the ratio between sequential execution time on a single processor and parallel execution time on multiple processors. Actually, this is a fundamental parameter to consider, not only because it expresses the amount of increase in execution speed while increasing the power of the underlying computing system, but also because the speedup curve provides indications on how the execution speed scales. Linear speedup means that the execution speed scales linearly with the computing power. This is an indication that parallel implementation maintains the same effectiveness, independent of the number of used processors, thus the implementation itself does not suffer, for example, from an excessive increase in the communication overhead while increasing the number of processes involved. We also report the ratio between the observed speedup and the ideal that can be achieved with a given degree of parallelism, that is, with a given number of used processors. (We recall that the ideal speedup on n processors is equal to n, which means we experience no overhead, but only gain by distributing the work to be performed on the n processors.) This parameter provides indications on the extent to which the parallel implementation can be considered effective, independent of the shape of speedup curve. Specifically, if we have a linear speedup curve but a low ratio over the ideal speedup, it means that the parallel implementation, although not suffering particularly from the increase in overhead due to parallelization while increasing the amount of processors, is nonetheless ineffective, for example, due to inadequate structuring of the parallel algorithm it implements.
Beyond speedup, another parameter we report is the wall-clock time for the reduction. This parameter expresses the time cost for a given reduction and also how it varies while increasing the computing power of the underlying platform. It is a fundamental parameter to report, since it provides indications of whether the speedup curve has been evaluated over a representative interval concerning the number of used processors. Specifically, if wall-clock time values of a few seconds or less are achieved while increasing the number of processors, then an additional increase in computing power does not make sense for this specific reduction (this is because a response time of few seconds or less is typically considered satisfactory, even for the case of an interactive end-user, i.e., the case in which responsiveness is actually a critical issue to address), hence the speedup has been evaluated over an adequate interval relative to the degree of parallelism.
Wall-clock time and speedup are parameters that express the effectiveness of parallel implementation when evaluated globally. However, we are also interested in observing the effects of the specific optimizations we have proposed. In this regard, we also report data that allows the evaluation of benefits from the VAB message aggregation technique discussed in Section 5.2 and the effectiveness of the load-balancing policy presented in Section 5.3. To evaluate how VAB impacts communication cost with an increasing number of processors, we report the product between the average number of application messages delivered through a single MPI message, that is, the average size of the aggregate (which we will refer to as the AAS), and the number of used processors. This product is representative of the system's capacity to send application messages at the time cost of sending a single MPI message. Specifically, when using n processors, the hosted processes can perform send operations of MPI messages concurrently. Therefore, within the wall-clock time of a single send operation, we are, on average, able to send n MPI messages in parallel. As a consequence, if the product between the AAS and the number of processors increases, we have that the time cost for the send of each application message gets reduced, with a consequent reduction of the overall communication overhead on each processor. (For completeness, we also report the plot for AAS so as to show its behavior while increasing the number of processors.) Concerning load distribution, we report plots related to the variation over time of the number of unprocessed application messages, namely upm, stored in the incoming buffer for different processes (recall that upm has been used in Section 5.3 as information on the current load on each processor, to determine the distribution of the new nodes dynamically originated during computation). This parameter is representative of the effectiveness of the load-balancing strategy we have adopted, since it provides indications of whether the work list that tracks the number of edges to be composed is approximatively the same for all processes at any time during the execution.
Results for DD4
The experimental measures obtained for DD4 are reported in Figures 16, 17 , and 18. By examining the plots in Figure 16 , we observe that the speedup curve remains linear over the whole interval in regard to the number of user processors (i.e., up to 32), and also that the speedup value is constant; on the order of 70% of the ideal speedup. Combined, these two plots indicate that the parallel implementation is effective in terms of both the structuring of the parallel algorithm (this provides high values with respect to the ideal speedup) and the ability of the performance to remain efficient while the number of used processors is increased. Also, the wall-clock time curve, reported throughout Figure 16 , demonstrates that the speedup plots provide reliable performance indications in the sense that the speedup has been evaluated over an adequate interval in regard to the number of used processors. Specifically, with 32 processors, the wall-clock time for the computation is on the order of 2.2 seconds, which is not only a definite reduced value as compared to the sequential execution time (i.e., execution time on a single processor of about 50 seconds), but also represents a satisfactory response time for an interactive end-user. With respect to the latter point, we have performed the reduction of DD4 on the same computing platform using the BOHM sequential reduction machine (version 1.1) and have observed wall-clock time on the order of 5 seconds. Considering the fact that for DD4, the safe operators optimization implemented by BOHM plays a major part in reducing the number of interactions, the ability of our implementation to achieve a wall-clock time 60% shorter than BOHM demonstrates how our parallel approach and design constitute a viable orthogonal alternative for enhancing the performance of the reduction system.
The plots related to the behavior of the VAB aggregation technique in Figure 17 and to the variation of upm over time in Figure 18 additionally help in understanding the reason why the implementation remains effective while the number of used processors is increased. The AAS curve in Figure 17 shows that the average number of application messages aggregated within each MPI message gets reduced from about 11.5 to 3 while moving from execution on 2 processors to execution on 32 processors. This is an expected behavior in light of the fact that a larger number of used processors means that each process P i involved in the parallel execution needs to manage an increased number of channels towards other processes. As a consequence, the application messages produced by P i must be distributed over a larger number of aggregation buffers out buff i, j , culminating in a reduced capacity to aggregate messages within a given time interval in each buffer. However, observing the curve related to AAS multiplied by the number of used processors as shown throughout Figure 17 , we have a clear indication that the system capacity for sending application messages at a given cost linearly increases with the number of processors, with a slope of about 0.8 (recall that the ideal case for the curve of AAS multiplied by the number of processors would be for a slope equal to 1). Specifically, when moving from k to 2k processors, the system increases its capacity to send application messages at the same time cost of about 1.6 times. This is a clear indication that VAB allows the communication overhead to scale well with the size of the underlying computing system. Relevant to the variation of upm over time, in Figure 18 we report the case of execution on four processors. As evidenced by the plots, the load of unprocessed application messages stored by each process P i within the incoming i buffer is well-distributed on each of the four processors during the whole execution period, thus supporting the claim for the effectiveness of the load-balancing mechanism described in Section 5.3. 
Results for EXP3
The data obtained for EXP3 allows, in light of the observed results for DD4, determination of additional information on the runtime behavior of our implementation, also in terms of effects of a particular application on the achievable performance. The main difference with respect to the case of DD4 is that this speedup curve does not remain linear in the number of processors. Specifically, the plots in Figure 19 show that the speedup asymptotically tends to a constant value (on the order of 12), with a consequent decrease of the ratio over the ideal speedup. From data related to the effects of VAB and to the variation of upm over time, it can be deduced that such behavior is not due to ineffectiveness of the parallel implementation (e.g., in terms of an increase in communication overhead while increasing the number of processors, or load imbalance). Specifically, the curve in Figure 20 related to AAS multiplied by the number of used processors clearly shows that in this case also, the implementation is able to carefully control the communication overhead while the size of the underlying computing system is increased. More precisely, the linearity of such a curve, with a slope on the order of about 1, namely, the ideal slope value, provides clear indication that the implementation is able to control the communication overhead even more effectively than for the case of DD4 (recall that the slope for the same curve for DD4 was 0.8, thus lower). Moreover, the plots for upm in Figure 21 show that the load remains balanced while moving from execution on a single processor to multiple processors.
Actually, this type of behavior for the speedup curve is due to the fact that EXP3 exhibits a final computation phase, made up of a very limited number (i.e., few units) of unprocessed application messages. This can be clearly observed when looking at the plots in Figure 21 related to the variation of upm over time (we also report the case of single processor execution, since this is a reference case to assess the length of such a final phase). In other words, during this final phase, the computation becomes inherently sequential (i.e., few unprocessed application messages produce-once extracted from the incoming buffer and composed through HC-few new application messages carrying new edges for the virtual net). As a consequence, during the final phase, parallelism cannot be exploited for EXP3, which is exactly the reason why speedup asymptotically tends to a constant value. This type of problem had been already addressed by Mackie [1997] for the case of parallelism in the form of adequate assignment of initial nodes of the interaction net over the used processors (recall that this solution is based on a static analysis of the initial interaction nets and differs from our proposal in that we dynamically control load distribution and other performance indexes at runtime). Specifically, and also for Mackie's approach, the benefits from exploitation of multiple processors in the computing system are bounded by the phases of sequential computation (if any) that are intrinsic to the specific application. However, as additional evidence that our implementation is definitely able to exploit parallelism, whenever present within specific execution phases, we note that if the speedup results had been computed excluding the final sequential phase (according to Figure 21 , such a phase lasts about three seconds, which is the reason why the wall-clock time asymptotically tends to three seconds while increasing the number of used processors), they be even better than those obtained for the case of DD4. Specifically, the speedup would be on the order of at least 75% of the ideal one over the whole interval for the number of used processors.
We could not report data related to the wall-clock time achieved with the BOHM machine for EXP3, since this machine does not support the specification of ELL terms in its syntax. However, as a final observation, we note that for ELL terms safe operators optimization does not take effect, since the abstract Lamping's algorithm used to reduce these terms does not use control operators. Hence, for this type of terms, BOHM should achieve performance comparable to the case of single-processor execution for our machine (i.e., wall-clock time on the same order of magnitude).
CONCLUSIONS AND FUTURE WORK
The main contribution of this work relies on showing how it is possible to make a functional language, based on λ-calculus, transparently executable on a parallel/distributed environment. This result has been achieved by exploiting the decomposition of beta reduction into a set of more elementary execution steps, each independent of the others, which renders the execution model extremely flexible and easily supported by a parallel/distributed environment. Specifically, we exploited the properties of locality and asynchrony of directed virtual reduction, namely, the formal system providing the previously mentioned decomposition to develop the PELCR software package. This package manages the distribution of computational load due to the evaluation of a λ-term in a totally transparent way, by dynamically controlling/tuning any parameter that could potentially affect the efficiency of the runtime behavior.
We have used pure lambda calculus for the sake of simplicity and because we are mainly concerned with optimal reduction, but in a broader perspective it would be interesting to apply PELCR to an enriched lambda calculus. In this respect, an immediate way to apply PELCR to evaluating the terms of an extended language consists of embedding it in linear logic. Then, the corresponding proof net of linear logic can be executed in PELCR by using the corresponding dynamic graph. We applied this strategy for the benchmark EXP3, whose dynamic graph was executed by PELCR via the translation of ELL proof nets in the geometry of interaction.
One major future improvement to make our parallel environment even more effective would be to embed within PELCR an implementation of Asperti's safe operators. This task, however, is not trivial, mainly because the heuristic algorithm developed in Asperti and Chroboczek [1997] exploits global conditions on the state of the reduction whose evalutation comes for free when using a sequential reduction approach. Hence, their solution is inherently sequential, and cannot be straightforwardly applied in a parallel/distributed reduction. Also, the feasibility of any parallel/distributed algorithm for safe operators still needs to be assessed.
Another improvement which could be considered to enlarge the applicability of PELCR consists of using additional syntactical sugar for the lambda terms to be interpreted by PELCR. This extension could be obtained following the approach introduced by Mackie [1994] and then expanded upon by Pinto [2001c] . In order to include an evaluation mechanism for integervalued functions in the geometry of interaction, Pinto shows how to extend L with extra generators for integers. In fact, this technique can be summarized with the addition of generators in L with side-effects consisting of the data to be manipulated and executables to be evaluated when interactions occur. A possible direction for future work in this arena would be to map these generators onto data structures and functions in standard programming libraries.
