ParFORM: recent development by Tentyukov, M. et al.
ar
X
iv
:c
s/0
51
00
93
v1
  [
cs
.SC
]  
31
 O
ct 
20
05 ParFORM: recent development ∗
M. Tentyukova†, J.A.M. Vermaserenb and H.M. Staudenmaierac,
a Institut fu¨r Theoretische Teilchenphysik, Universita¨t Karlsruhe, Germany
bNIKHEF, Amsterdam
c Interfak. Institut fu¨r Anwendungen der Informatik, Universita¨t Karlsruhe, Germany
We report on the status of our project of parallelization of the symbolic manipulation program FORM. We
have now parallel versions of FORM running on Cluster- or SMP-architectures. These versions can be used to
run arbitrary FORM programs in parallel.
1. General conceptions of current version
FORM [1] is a program for symbolic manipula-
tion of algebraic expressions specialized to handle
very large expressions of millions of terms in an
efficient and reliable way. That is why it is widely
used in Quantum Field Theory, where the calcu-
lation of the order of several hundred (sometimes
thousands) of Feynman diagrams is required.
In context with this goal an improvement in
computing efficiency is very important. Paral-
lelization is one of the most efficient way to in-
crease performance. So the idea to parallelize
FORM is quite natural.
ParFORM is the parallel version of FORM de-
veloped in Karlsruhe since 1998 [2]. At present, a
number of real physical applications exist which
were performed with the help of ParFORM [3].
There are some internal mechanisms of FORM
that makes FORM very well suited for paralleliza-
tion [2,4]. The concept of parallelization is in-
dicated in Fig. 1: Distribute the input terms
among available processors, let each of them per-
form local operations on its input terms, generate
and sort the arising output terms. At the end of
a module the sorted streams of terms from all
processors have to be merged to one final output
stream again.
A master process initializes the distribution of
∗Supported by SFB-TR9
†On leave from BLTP, Joint Institute for Nuclear Re-
search, Dubna, Russia
   
   
   
   
   
   






  
  
  
  
  
  






   
   
   
   
   
   






  
  
  
  
  
  






Master
Master
Slave I
I chunk II chunk
MPI
Slave II
MPI
output  result,Final sorting,
go  to the  next  module
id x = a + b;
l expr = a*x+x^2+b*x+...
2*a^2  +3*a*b  +b^2
 a*x  +x^2               +b*x  ...            
  a*x     +x^2              +b*x  ...            
a^2 +a*b +a^2 +2a*b +b^2
G e n e r a t i n g
S o r t i n g 
Figure 1. General conception of ParFORM: a
Master-Slave structure for parallelization.
terms and finally collects the results The real
and time consuming calculations however are per-
formed by slaves. Each process is an independent
stream of commands operating on independent
data. The master communicates with slaves by
means of different message passing libraries and
we use MPI1.
The master simply distributes and collects
1see http://www-unix.mcs.anl.gov/mpi/standard.html
1
2 M. Tentyukov, J.A.M. Vermaseren and H.M. Staudenmaier,
data, i.e with a lower number of processors, the
master becomes almost idle. For that case one
can try to force the master to participate in real
calculations, too. On the other hand, with an
increasing number of slaves, the master spends
more and more time to control slaves, which may
lead to early speedup saturation. Our estima-
tions show that for more than four processors our
Master-Slave model is adequate. Since almost all
real calculations are performed by slaves we cal-
culate speedups normalized to the time spent by
program running on two processors, one master
and one slave.
Using the message passing library permits to
parallelize FORM on computer architectures, i.e.
with shared (SMP2) and distributed (clusters)
memories.
The results for the program BAICER3 run-
ning on the SMP SGI Altix 3700 Server 32x 1.3
GHz/3MB-SC Itanium2 CPUs are shown in Fig.
2. The speedup is almost linear up to 12 proces-
sors. Afterwards the speedup becomes nonlinear
but is still considerable.
The second architecture is a cluster. The re-
sults of running BAICER on an IWR4 Xeon clus-
ter [6] are shown in Fig. 3.
The speedup curve has a “positive” slope even
for more than 8 processors, but the absolute value
of this slope is rather small, and so the speedup is
reasonable only for a few nodes. In case of cluster
computers as [6] it could be better to involve the
master processor in real calculations too, but this
should be studied in detail.
2. ParFORM on SMP
The main disadvantage of the message pass-
ing approach is a considerable overhead due to
huge data transfers. On SMP computers one
can attempt to get rid of this overhead using e.g.
threads [7]. But in this concept there are several
points which have to be taken into account:
2Symmetric MultiProcessor
3All benchmarks mentioned in the paper were made by
running the same “standard” test example [4] obtained
from a package BAICER developed by P. Baikov following
methods described in [5].
4The Institute for Scientific Computing of the
Forschungszentrum Karlsruhe.
0
1000
2000
3000
4000
5000
6000
7000
4 8 12 16 20 24 28 32
Ti
m
e 
(se
c)
Number of processors
0
2
4
6
8
10
12
14
4 8 12 16 20 24 28 32
Sp
ee
du
p
Number of processors
Figure 2. Computing time and speedup for the
test program BAICER on the SGI Altix 3700
server with 32x Itanium2 processors (1.3 GHz),
a SMP-type machine.
Within a typical SMP machine, all the memory
is uniformly available to each processor, so-called
“Uniform Memory Access”. All memory accesses
are made by the same shared memory bus. This
works quite well for a relatively small number of
CPUs. Increasing the number of CPUs, a prob-
lem with the shared bus appears due to the colli-
sion rate between multiple CPU requests on the
single memory bus.
In order to avoid these scalability limits of
SMP architectures, the “Non-Uniform Mem-
ory Access” (NUMA) architecture was designed.
NUMA assumes that each processor has its own
local memory but it can also access memory
owned by other processors.
As a result the concept of Memory Affinity has
to be introduced: memory may be situated at dif-
ferent “distance” from the processor. On a SGI
ParFORM: recent development 3
1500
2000
2500
3000
3500
4000
4500
5000
5500
2 3 4 5 6 7 8
Ti
m
e 
(se
c)
Number of processors
1
1.5
2
2.5
3
3.5
2 3 4 5 6 7 8
Sp
ee
du
p
Number of processors
Figure 3. Computing time and speedup for the
test program BAICER on the cluster of dual Intel
Xeon, 2.4 GHz, 4x-InfiniBand.
Altix, the ratio of remote to local memory access
times varies from 1.9 to 3.5, depending on the rel-
ative locations of the processor and the memory.
Usually this is not a problem since nearly all
CPU architectures use a cache to exploit locality
of reference in memory accesses. Because nearly
everything is in a cache, often one may safely
ignore problems resulting from the difference in
memory affinity. But not in the case of FORM as
discussed below.
As consequence of the frequent “cache-use” -
as just described - the new problem of cache co-
herency arises: if one of the processors modifies
some piece of data (i.e., performs a “write” op-
eration), then the other processors have access
only to an out-of-date copy of these data stored
in their cache. Normally the cache data have to
be invalidated. Usually, NUMA computers use
special-purpose hardware to maintain cache co-
herence. Such systems are called “cache-coherent
NUMA”, or ccNUMA [8]. The worst case for such
an approach is mutual cache invalidation, when
two (ore more) processors are writing to the same
memory region.
Let us consider now how ccNUMA could be
used for a good parallelizable problem as the mul-
tiplication of two matrices.
Let us take the simplest algorithm is well suited
for multithreaded process: each thread reads one
row of the first matrix, one column of the sec-
ond matrix and sums up the result of the multi-
plication in a local- or even register-variable. In
this case the only instruction is “read from mem-
ory”. Practically all arithmetic operations are
performed on the local memory, and only at the
end the result is written into the global (shared)
memory.
Thread 2Thread 1
G e n e r a t i n g
S o r t i n g 
+2*a^2   +3*a*b    +b^2
+a^2   +a*b    +a^2   +a*b   +a*b    +b^2
a*x      . . .       +x^2
Figure 4. Possible multithreaded approach of
FORM parallelization.
Unfortunately, the structure of FORM is quite
different. Indeed, let us suppose that each thread
treats one term, Fig. 4. Then the thread pro-
duces a lot of new terms which should be stored
in the shared memory. This would lead to per-
manent cache invalidation. It indicates that the
internal FORM structure is not well suited for
multi-threaded parallelization on ccNUMA archi-
tecture.
Alternatively, one can exploit the multi-process
Master-Slave structure,
4 M. Tentyukov, J.A.M. Vermaseren and H.M. Staudenmaier,
       
       
       
       
       
       
       
       
       
       
       











   
   
   
   
   
   
   
   
   
   










       
       
       
       
       
       
       
       
       
       
       











      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      


















  
  
  
  
  
  
  
  
  
  










PROCESS0
D
A
TA
SH
AR
ED
D
A
TA
SH
AR
ED
D
A
TA
PROCESS2
Output of:
Slave 1 Slave 2
D
A
TA
D
A
TA
WORKINGWAITING WORKING
RESULT
Slave2PROCESS1
Slave1
Slave 2Slave 1
Input for:Master
Figure 5. Master-Slave approach on SMP with-
out MPI.
Instead we could try to use the Master-Slave
model discussed above with multiple processes,
but now of course without MPI. For communi-
cation between the master and slaves we could
use shared memory, allocating the shared mem-
ory buffers “close” to each slave, Fig. 5. We refer
to this model as “Shared Memory” (SM). As be-
fore, the master splits data into chunks and dis-
tributes them among slaves placing data to shared
memory buffers. Slaves manipulate these data in
their local memory, and the (pre-sorted) results
are collected by the master. Here we have ex-
plicit control on memory affinity, and no message
passing bottleneck anymore.
The Master-Slave model permits to optimize
the communication between slaves and the mas-
ter. For example, no direct communication be-
tween slaves is allowed5, so a lot of optimization
available at low level provided the structure is re-
stricted by this communication topology.
3. First results
We implemented the ideas described in the pre-
vious section in ParFORM (we call it ParFORM-
SM) and want to present first results in the fol-
5MPI has “peer-to-peer” structure, which is much more
complicated.
lowing.
In Fig. 6 one can see the comparison with the
results from Fig. 2. We can immediately see
about 20% performance improvement compared
with the previous MPI version. But the most im-
portant observation is that now the communica-
tion overhead is almost negligible as can be seen
comparing the first two data points in Fig. 6 a).
0
2
4
6
8
10
12
14
2 6 10 14 18 22 26 30
Sp
ee
du
p
Number of processorsb)
speedup MPI
speedup SM
0
1000
2000
3000
4000
5000
6000
7000
2 6 10 14 18 22 26 30
Ti
m
e 
(s)
Number of processors
a)
MPI
SM
Figure 6. Results of running the test program on
SGI Altix 3700 with MPI-based communications
(MPI) and with Shared Memory segments (SM)
normalized to the sequential version.
Here we normalize the results not to the two-
processor time, but to the time spent by the
corresponding sequential version of the program.
Looking at the difference in times between one
processor (sequential program) and two proces-
ParFORM: recent development 5
sors for MPI variant (solid line), we may see about
20% of performance reduction. The reason is due
to the communication overhead. Indeed, in a two-
processor mode the single slave is doing almost
all the job (except the final sorting) and the pro-
gram spends some extra time for communication
between the master and the slave.
For the shared memory based program, the dif-
ference in one- and two-processor regimes is un-
observable. This indicates that the communica-
tion overhead has no real significance in this SM
model.
Increasing the number of processors, an other
bottleneck arises: the time for final sorting be-
comes more and more essential. Since this sorting
is performed only by the master, all the slaves are
idle during this stage. This explains the speedup
saturation around 30 nodes both for MPI and SM
approaches.
4. Outlook
We shortly want to discuss the various aspects
of the models and architectures described before.
On ccNUMA computers, instead of MPI, we
should use the multiprocessed model (see Sect. 2)
with multiple shared memory segments. The
corresponding shared memory approach was de-
veloped, tested and demonstrates stable perfor-
mance improvement around 20%. The communi-
cation overhead is negligible and the main bottle-
neck is the final sorting stage, so it seems to be
reasonable to parallelize in future the final sorting
process first.
On clusters, there are no alternatives to MPI
at the moment.
In the present cluster version of ParFORM the
communication overhead is quite big and thus it
can only be used for a relatively small number
of nodes. In particular, it seems that it would
be advantageous if also the master participates
in real calculations.
Colleagues who are interested to use ParFORM
software should contact M.Tentyukov
REFERENCES
1. J.A.M. Vermaseren, Symbolic Manipula-
tion with FORM, CAN (Computer Alge-
bra Nederland), Kruislaan 413, 1098 SJ
Amsterdam (1991); J.A.M. Vermaseren,
math-ph/0010025
2. D. Fliegner, A. Retey and J.A.M. Ver-
maseren, hep-ph/9906426; D. Flieg-
ner, A. Retey and J.A.M. Vermaseren,
hep-ph/0007221
3. A. Retey and J.A.M. Vermaseren, Nucl.
Phys. B604 (2001) 281-311; P.A. Baikov,
K.G. Chetyrkin and J.H. Kuhn, Phys.
Rev. Lett. 88 (2002) 012001; P.A. Baikov,
K.G. Chetyrkin and J.H. Kuhn, Phys.
Lett. B559 (2003) 245-251; P.A. Baikov,
K.G. Chetyrkin and J.H. Kuhn, Phys.
Rev. D67 (2003) 074026; P.A. Baikov,
K.G. Chetyrkin and J.H. Kuhn, Eur.
Phys. J. C33 (2004) 650-S652; S. Bekavac,
hep-ph/0505174; A. Kotikov, J.H. Kuhn and
O. Veretin, “Two-loop formfactor in theories
with mass gap”, in preparation
4. M. Tentyukov et al, “Parallel Version of the
Symbolic Manipulation Program FORM”,
in: V.G. Ganzha et al (Eds.), Pro-
ceedings of the CASC 2004, Technische
Universita¨t Mu¨nchen, Garching, Germany;
cs.SC/0407066
5. P.A. Baikov, Phys. Lett. B385 (1996) 404-
410; P.A. Baikov, Phys. Lett. B474 (2000)
385-388; hep-ph/0507053
6. U. Schwickerath and A. Heiss, Nucl. Inst.
Meth. in Phys. Research A, 534 (2004) 130-
134; see also:
http://www.fzk.de/infiniband
7. The textbook: A. Tannenbaum, “Modern
Operating Systems”, 2nd ed., Prentice Hall,
2001; online Wikipedia:
http://en.wikipedia.org/wiki/
Thread (computer science)
8. http://en.wikipedia.org/wiki/
Non-uniform memory access;
SGI-specific: J. Laudon and D. Lenoski, “The
SGI Origin ccNUMA Highly Scalable Server,”
SGI Published White Paper, March 1997
