Parameterized Partition Valuation for Parallel Logic Simulation by Hering, Klaus et al.
Proc. Int. Conf. on Par. and Distr. Comp. and Netw. (PDCN'97), p.144-150, 1997.
PARAMETERIZED PARTITION VALUATION
FOR PARALLEL LOGIC SIMULATION
KLAUS HERING, REINER HAUPT and UDO PETRI
Institute of Computer Science
University of Leipzig
Augustusplatz 10-11, 04109 Leipzig
Germany
ABSTRACT
Parallelization of logic simulation on register-transfer
and gate level is a promising way to accelerate ex-
tremely time-extensive system simulation processes
during the design of whole processor structures. The
background of this paper is given by the functional sim-
ulator parallelTEXSIM realizing simulation based on
the clock-cycle algorithm over loosely-coupled parallel
processor systems. In preparation for parallel cycle
simulation, partitioning of hardware models is neces-
sary, which essentially determines the e±ciency of the
following simulation.
We introduce a new method of parameterized parti-
tion valuation for use within model partitioning algo-
rithms. It is based on a formal de¯nition of parallel
cycle simulation involving a model of parallel compu-
tation called Communicating Processors. Parameters
within the valuation function permit consideration of
speci¯c properties related to both the simulation target
architecture and the hardware design to be simulated.
Our partition valuation method allows performance es-
timation with respect to corresponding parallel simu-
lation. This has been con¯rmed by tests concerning
several models of real processors as, for instance, the
PowerPC 604 with parallel simulation running on an
IBM SP2.
KEYWORDS:
Parallel logic simulation; Model partitioning algo-
rithms; Performance estimation; IBM SP2
1 INTRODUCTION
Due to challenging technological capabilities the at-
tainable complexity ofVLSI designs is growing rapidly.
Therefore, the employment of veri¯cation tools in all
design phases is unavoidable. Simulation is a very im-
portant VLSI design veri¯cation method. The back-
ground of our work is given by functional simulation
on register-transfer and gate level (logic simulation)
without consideration of timing aspects. In [1] a simu-
lation strategy is presented with underlying hardware
models embodying complete processor structures and
simulation stimuli representing microprogrammes or
machine instruction sequences. During system sim-
ulation, time-extensive simulation runs for ¯nal vali-
dation of complex designs are considered. Aiming at
signi¯cant run-time reductions for such processes we
have parallelized the sequential functional simulator
TEXSIM 1 which operates on the basis of the clock-
cycle algorithm. The simulator parallelTEXSIM is
documented in [2]. We chose a parallelization approach
making use of model inherent parallelism. Within a
corresponding parallel cycle simulation, several simu-
lator instances co-operate over a loosely-coupled pro-
cessor system, each instance simulating a part of a syn-
chronous hardware design. Therefore, in preparation
for parallel simulation, partitioning of the whole hard-
ware model is necessary which essentially determines
the run-time behaviour of a following simulation.
Starting from ideas of D.Zike and W.Roesner
presented in [3] we have developed a hierarchical par-
titioning strategy outlined in [4] based on fan-in cones
as elementary building blocks for partitions. Related
work to cone-based partitioning can be found in [5] and
[6]. Within the taxonomy of partitioning techniques
given in [7] our strategy embodies a bottom-up clus-
tering approach. Partitioning algorithms regarded at
the end of this paper follow a special two-stage strat-
egy, where Evolutionary Algorithms are applied after a
fast pre-partitioning phase which reduces the problem
complexity.
In the context of the formulation of partition-
ing problems as combinational optimization problems,
partitions are related to quantities via cost functions
(partition valuation). With respect to parallel simu-
lation as our special subject, assigned values more or
less directly express a connection to parallel simula-
tion run-time. In this paper, we introduce a parame-
terized partition valuation function on the basis of a
formal de¯nition of parallel cycle simulation realized
1developed by IBM
1
by parallelTEXSIM. The corresponding framework of
formal concepts is fully represented in [8]. Within our
valuation method aspects of workload and interproces-
sor communication are combined. Parameters derived
from pre-simulation and machine benchmarking allow
speci¯c properties of both the hardware design and
the simulation target architecture to °ow into parti-
tion valuation. The time-driven nature of the under-
lying simulation allows direct deduction of run-time
estimations for sequences of cycles from estimations
concerning one cycle.
Our paper is organized as follows. In Section 2
the simulator parallelTEXSIM is brie°y introduced.
Section 3 provides the de¯nition of a Structural Hard-
ware Model (SHM). Furthermore, model partitions
are introduced as families of cone sets and partition-
based Parallel Structural Hardware Models (PSHMs)
are constructed as sets of SHMs. In Section 4 PSHMs
are combined with a model of parallel computation
called Communicating Processors (CP) to de¯ne Par-
allel Cycle Simulation (PCS) as special behaviour of
CP. Based on the preceding concepts a parameterized
partition valuation function is developed in Section 5.
Experimental results related to model partitioning and
corresponding parallel simulation on an IBM SP2 sys-
tem for real processor designs are given in Section 6.
Finally, Section 7 summarizes the work presented and
outlines future objectives.
2 THE PARALLEL SIMULATOR
The simulator parallelTEXSIM was implemented by
D.DÄohler under the AIX Parallel Environment2
starting from the sequential simulator TEXSIM which
was developed by D.Zike (IBM ). TEXSIM performs
logic simulation on the register-transfer and gate level
in zero-delay mode using the clock-cycle algorithm.
The simulation of one clock-cycle mainly consists of
two parts: evaluation of (rank-ordered) combinational
logic and updating of storing elements (latches) which
represent cycle boundaries. During the simulation of
one cycle with parallelTEXSIM several simulator in-
stances sTEXSIM (sequential TEXSIM enriched by a
communication shell) co-operate over a loosely-coupled
processor system, each simulating a part (model block)
of the whole design under consideration. At cycle
boundaries collective communication takes place. All
interprocessor communication is realized making use
of the Message Passing Library (MPL) of the AIX
Parallel Environment. The parallel simulator is both
running on IBM SP2 systems with processors coupled
via a High-Performance Switch and on IBM RS/6000
workstation clusters. Besides the sTEXSIM compo-
nents one master component mTEXSIM is responsible
for the realization of the parallelTEXSIM Application
2product of IBM
Programming Interface (API) providing the possibil-
ity of simulation control by dynamically linked clients.
Furthermore, mTEXSIM co-ordinates the sTEXSIM
components within parallel simulation.
3 HARDWARE MODELS
The graph models introduced in the following provide
a formal basis for the development, investigation and
implementation of model partitioning algorithms in the
context of parallel cycle-based simulation. They are
not restricted to the special simulator TEXSIM.
3.1 THE BASIC MODEL
Essential components of our basic structural hardware
model are given by a family of pairwise disjoint sets
which comprises the set ME of logical boxes (logical
gates, multiplexers, : : :), the setMI of input boxes, the
set MO of output boxes, the set ML of storing boxes
(latches) and the set MS of nets representing wires.
With MB = ME [MI [MO [ML and MB;MS 6= ;,
a directed bipartite graph
M = (MB ;MS ;MR)
is called Structural Hardware Model (SHM), if
MI is the set of all sources of M , MO is the set of all
sinks of M and any directed cycle in M includes at
least one element of ML.
MR µ (MB £MS) [ (MS £MB) describes the
connection of elements (boxes) of MB with nets of
MS . There are no directed cycles within SHMs cov-
ering boxes belonging to ME [MS exclusively. With
respect to underlying hardware designs this re°ects the
absence of asynchronous feedbacks in combinational
logic. Figure 1 roughly illustrates a SHM. Thick ar-
rows represent sub-sets of MS .
FIGURE 1: STRUCTURAL HARDWARE MODEL
For a formal description of (sequential) clock-cycle
simulation over a SHM M we regard the boxes of MB
as elemental carriers of simulation activities. To iden-
tify these activities we introduce a set of abstract ac-
tions
A = AI [AE [AO [AL
2
with a bijective assignment function a : MB ! A
assuming a(M!) = A! for ! 2 fE; I;O; Lg. With k
denoting sequence concatenation and A+ representing
the set of all ¯nite non-empty sequences over A, special
sequences
sseq 2 AI
+ k AE
+ k AO
+ k AL
+
are introduced as Sequential Cycle Simulation
(SCS) with respect to M . In particular, each action
of A appears exactly once within a SCS and the or-
der of actions belonging to AE is restricted by a lev-
elizing of ME which is determined by MR. Remark,
that our consideration is restricted to the simulation
of one clock-cycle. In the context of this paper actions
are not supplied with semantic details as, for instance,
state transferring functions. According to partition
valuation presented in Section 5 they are considered
as sources of simulation expense.
3.2 CONE-BASED PARTITIONS
Within our model partitioning approach for parallel
cycle simulation we take fan-in cones as basic build-
ing blocks for partitions. This choice allows build-
ing of sub-models for parallel cycle simulation over
which sTEXSIM instances in their simulation kernel
can work in the same way as TEXSIM instances.
Let M be a SHM and x 2 ME [ML [MO. The
fan-in cone co(x) (with respect to M) is de¯ned as
sub-set of MB containing x itself and all logical boxes
ofME for which a directed path to x exists which does
not cover elements of MI [ML[MO. For partitioning
of M special fan-in cones (shortly called cones) given
by
Co(M) = fco(x) j x 2ML [MOg
are taken into consideration. Then, a partition ¦ of
Co(M) in mathematical sense is called a partition of
M .
Di®erent elements (cones) of Co(M) may have
common boxes (cone overlapping). If we assume, that
a partition component determines the model part to
be handled by a simulator instance on a single proces-
sor during parallel cycle simulation, then overlapping
cones as elements of di®erent partition components
stand for replication of simulation work. Besides this
drawback the cone-based partitioning approach bears
the advantage that interprocessor communication dur-
ing parallel cycle simulation is concentrated at cycle
boundaries.
3.3 THE PARALLEL STRUCTURAL
HARDWARE MODEL
Partitions of a SHM M are now translated into sets of
"sub-models" of M . Thereby, for each partition com-
ponent (representing a cone set) a corresponding SHM
is built.
Let us consider a SHM M , a partition ¦ ofM and
a cone set C 2 ¦. The construction of sub-models of
M with respect to ¦ as triplets
MC¦ =
¡
MCB;M
C
S ;M
C
R
¢
with MCB = M
C
E [M
C
I [M
C
O [M
C
L is described in de-
tail in [8]. For a short consideration of the component
determination we assume BC =
S
c2C
c. Then MCE and
MCL are de¯ned as follows:
MCE = B
C \ME ; M
C
L = B
C \ML. (3.1)
MCO;O = B
C \ MO yields the set of global output
boxes ofMC¦
¡
MCO;O µM
C
O
¢
. All elements ofMI "feed-
ing" C (elements from which a path of length 2 ex-
ists leading to an element of BC) form the set MCI;I
of global input boxes of MC¦
¡
MCI;I µM
C
I
¢
. In the
context of communication processes at cycle bound-
aries between MC¦ and other sub-models related to ¦
"arti¯cial" boxes (not present within MB) are intro-
duced. Thereby, input and output boxes appear as
ordered pairs (C¤; s) and (s;C¤¤), respectively (s 2
MS ; C
¤; C¤¤ 2 ¦; C¤; C¤¤ 6= C). MC
¤
¦ is to be inter-
preted as communication source andMC
¤¤
¦ represents a
communication target (with respect toMC¦). s embod-
ies a physical connection in the context ofM "leaving"
C¤ and "leading" to C or "leaving" C and "leading" to
C¤¤. With MCI;L built from all input boxes (C
¤; s) and
MCO;L built from all output boxes (s; C
¤¤), we set
MCI =M
C
I;I [M
C
I;L; M
C
O =M
C
O;O [M
C
O;L. (3.2)
MCS andM
C
R
are constructed in a straight forward man-
ner (see [8]). The resulting sub-model MC¦ is proved to
be a SHM. The set
M¦ =
©
MC¦ j C 2 ¦
ª
(3.3)
is called Parallel Structural Hardware Model
(PSHM) with respect to ¦. In Figure 2 a sub-model
MC¦ in the context of a PSHM is represented schemat-
ically.
As for SHMs not standing in the context of a
PSHM, the behaviour of PSHM components is de¯ned
as action sequence again. For a sub-model MC¦ of M
with respect to ¦ we consider an action set
AC = ACE[A
C
I;I[A
C
I;L[A
C
O;O[A
C
O;L[A
C
L[fcg (3.4)
with a bijective assignment function a :MCB !A
Cnfcg
assuming a
¡
MC!
¢
= AC! (! representing an arbitrary
variant of the lower indices appearing in (3.4)). Di®er-
ent from SCS, a special action c not bound to a special
3
FIGURE 2: PARALLEL STRUCTURAL HARD-
WARE MODEL
box is involved representing component communica-
tion at cycle boundaries. AC re°ects the splitting of
the sets of input and output boxes within MC
¦
. A se-
quence sCseq 2
¡
AC
¢+
is called Extended Sequential
Cycle Simulation (ESCS) with respect to MC
¦
if
each action of AC appears exactly once within sCseq,
actions of ACE are ordered with respect to a levelizing
of MCE (determined by M
C
R
) and sCseq has the following
structure:
sCseq = s
C
cycleks
C
comm; (3.5)
sCcycle 2 A
C
I;I
+
k ACE
+
k ACO;O
+
k ACL
+
; (3.6)
sCcomm = s
C
prek (c) ks
C
post; (3.7)
sCpre 2 A
C
O;L
+
; sCpost 2 A
C
I;L
+
.
(3:5) re°ects the restriction of communication be-
tween components involved in parallel cycle simulation
to cycle boundaries. sCcycle (3:6) appears as SCS with
respect to MC
¦
modi¯ed by omitting the actions from
ACI;L[ A
C
O;L. s
C
comm (3:7) represents three phases of
communication related work. sCpre, the ¯rst one, is
devoted to the preparation of interprocessor commu-
nication under sending aspect with respect to MC
¦ (ex-
traction of sub-model data with following placement
in communication related structures). Then, c stands
for a collective communication action at cycle bound-
aries. Finally, sCpost represents post-processing of in-
terprocessor comunication under receiving aspect with
respect to MC
¦
(extraction of data from communica-
tion related structures with following placement in sub-
model structures).
4 THE MODEL OF PARALLEL
COMPUTATION
For combining the behaviour of PSHM components to
a behaviour of the whole PSHM we make use of a
model of parallel computation introduced in [8].
4.1 COMMUNICATING PROCES-
SORS
Strongly inspired by the LogP model described in [9],
we have developed Communicating Processors (CP)
as model of parallel computation related to message
passing architectures. CP allows the consideration of
architecture dependent properties via parameters. Dif-
ferent communication mechanisms can be integrated
into the model corresponding to topical needs. The be-
havioural capabilities of single processes are described
in terms of sequences of abstract actions to have the
possibility of relating them to several interpretations
(for instance, to simulation time amount as basis for
partition valuation).
The Communicating Processors (CP) model is de-
¯ned as triplet P = (PP ; PA; PC) where
² PP = fP1; : : : ; Png is a set of (abstract) processors
working asynchronously,
² PA = fA1; : : : ;Ang is a family of ¯nite processor-
bound action sets and
² PC = fM1; : : : ;Mlg is a ¯nite set of communica-
tion mechanisms. A communication mechanism is
given as an ordered pair with a qualitative charac-
teristic as ¯rst component and a (possibly empty)
set of quantitative characteristics as second com-
ponent. A qualitative characteristic comprises
{ the determination of actions related to the
corresponding mechanism
{ the determination of source/target relations
within a set of involved processors
{ the determination of synchronization condi-
tions
A quantitative characteristic appears as a real
function or constant, valuating a communication-
related aspect.
In the context of CP, actions represent the exe-
cution of operations on the processors under consider-
ation. There is nothing said about their complexity.
The execution of an extensive high-level procedure can
be regarded as well as handling a microcode instruc-
tion.
Quantitative characteristics of communication
mechanisms are introduced within PC for allowing
communication properties of real parallel architectures
to °ow into the CP model. For instance, time bound-
aries of special communication-related events (see la-
tency, gap, overhead within LogP) could be such char-
acteristics. Another example is given by functions
4
yielding run-time estimations for communication pro-
cesses in dependence of the number of involved pro-
cessors, message lengths, network load situation and
similar arguments.
Within PC elementary point-to-point communi-
cation mechanisms built on send- and receive- actions
can be considered as well as collective communication
mechanisms with (usually) more than two actions (on
di®erent processors) involved. CP behaviour is deter-
mined by given (sequential) component behaviour and
communication mechanisms integrated into the model.
Concrete behaviour appears as sequence of action sets
which are interpreted as maximum sets of simultane-
ous active actions on di®erent processors. In [8], Un-
restricted Parallel Behaviour (UPB) is de¯ned as gen-
eral framework which is restricted to concrete CP be-
haviour by inclusion of synchronization conditions ac-
cording to the communication mechanisms considered.
4.2 PARALLEL CYCLE SIMULATION
In the following we outline the de¯nition of Parallel
Cycle Simulation with respect to a PSHM M¦ =
fM1; : : : ;Mng as introduced in (3:3). Connected to
this, a CP model
P = (Pp; PA; PC) (4.1)
is constructed with PP = fP1; : : : ; Png, each Mi cor-
responding to Pi. As processor-bound action sets Ai
those action sets are chosen, which belong to Mi ac-
cording to (3:4). Furthermore, we integrate exactly one
communication mechanism M into P (PC = fMg).
M does not depend on the concrete PSHM under con-
sideration. It is related to the mpc index -command
belonging to the Message Passing Library of the AIX
Parallel Environmen_t. This command is used for
the implementation of interprocessor communication
at cycle boundaries during simulation with parallel-
TEXSIM. The qualitative characteristic of M in the
framework of P is as follows:
² The only action engaged in M is c which is an
element of every action set Ai.
² The whole processor set PP is involved in M.
Each processor sends to each of the remaining pro-
cessors individual messages ( all-to-all personal-
ized communication).
² M is a collective communication for which n ac-
tions c (one at each processor) have to synchro-
nize.
In addition, there is one quantitative characteris-
tic ofM. It is given by a function estimating the time
needed for a corresponding collective communication in
dependence of the number of processors involved and
the message length under supposition of a SP2 con¯g-
uration with nodes connected via a High-Performance
Switch.
With Bi denoting the set of all Extended Sequen-
tial Cycle Simulations belonging toMi
¡
Bi µ Ai
+
¢
we
set
B = fB1; : : : ;Bng . (4.2)
Then, every unrestricted parallel behaviour sequence
s of P with respect to B which contains an action
set consisting of n actions corresponding to the com-
munication action c is called Parallel Cycle Sim-
ulation (PCS). In Figure 3, a PCS related to a
three-component PSHM with si
!
denoting sequences
of Ai
!
+
and i identifying PSHM components is de-
picted schematically. A sub-sequence of SCS structure
is shaded grey. Areas represented in black are related
to pre- or post-communication sub-sequences.
FIGURE 3: PARALLEL CYCLE SIMULATION
5 PARTITION VALUATION
In the following, partitions are related to run-time esti-
mations for PCS realized by parallelTEXSIM. Actions
are regarded as basic elements consuming simulation
time.
Let us consider an arbitrarily chosen partition ¦
of a SHM M . This implies a PSHM M¦ (3:3) together
with a family B (4:2) of sets of component behaviour
sequences (ESCSs) assuming a ¯xed rule of assigning
actions to boxes. Furthermore, a CP model P (4:1)
can be constructed which in dependence of B delivers
a set of PCSs with respect to M¦ as its behaviour.
From each PCS s for an arbitrary C 2 ¦ a "local"
behaviour sequence sC
seq
can be deduced with
sC
seq
= sC
cycle
ksC
pre
k (c) ksC
post
(5.1)
5
according to (3:5) and (3:7). We estimate component
simulation time tCseq on the basis of assigning execution
time values to corresponding sub-sequences omitting
possible idle intervals between their execution:
t
C
seq = t
C
cycle + t
C
pre + tc + t
C
post.
Due to the synchronization e®ect of c we obtain an
estimation of the PCS execution time tspar as follows:
t
s
par = max
C2¦
¡
t
C
cycle + t
C
pre + t
C
post
¢
+ tc. (5.2)
t
C
cycle is supposed to be determined by the evalu-
ation of logical boxes and latch updates, global input
and output actions are neglected. We set
t
C
cycle = t
M
B
¯
¯MCE [M
C
L
¯
¯ ; (5.3)
withMCE andM
C
L de¯ned as in (3:1) and t
M
B represent-
ing an average execution time for actions of ACE [A
C
L.
The parameter tMB is obtained from (sequential) pre-
simulation of the model M .
Pre- and post-processing of interprocessor commu-
nication is related to actions of ACO;L and A
C
I;L. The
amount of time for each such action is supposed to
be the same, expressed by the parameter tMcomm which
is inquired by (parallel) pre-simulation according to a
reference partition of M . With MCO;L and M
C
I;L as de-
scribed in the context of (3:2), we get
t
C
pre + t
C
post = t
M
comm
¯
¯MCO;L [M
C
I;L
¯
¯ . (5.4)
Finally, we have to consider tc, which is estimated
by a function given as quantitative characteristic of
the collective communication mechanismM integrated
into P . We suppose
tc =m (ta + tbn) (5.5)
with ta,tb being architecture-dependent parameters, m
denoting the maximum number of values which have
to be transferred between any pair of processors and
n = j¦j.
Taking together (5:3), (5:4) and (5:5) we have de-
termined our estimation of tspar under (5:2) completely.
The result is the same for all possible PCS sequences
related to M¦, because corresponding component be-
haviour sequences only di®er in the order of actions
within sub-sequences occuring in (5:1). This is without
relevance for the particular estimations given above.
Due to the time-driven nature of simulation consid-
ered we can immediately derive run-time estimations
for complex simulation processes covering a sequence
of cycles from tspar. With such run-time predictions we
are able to avoid expensive sub-model building and fol-
lowing simulation runs for partitions resulting in bad
simulation performance. The problem of early perfor-
mance prediction in general is considered, for instance,
in [10].
6 EXPERIMENTAL RESULTS
We have integrated the above partition valuation
method into our two-level partitioning scheme outlined
in [4]. At the ¯rst level fast pre-partitioning algorithms
are applied to reduce problem complexity by concen-
trating cones into super-cones. Here we consider the
algorithm of Mueller-Thuns et al. (MT ) described
in [6] and our STEP algorithm. Both algorithms yield
partitions which are balanced with respect to the num-
ber of cones within the partition components. There is
no explicit partition valuation at the ¯rst level of our
scheme. At the second level, Evolutionary Algorithms
(EAs) are applied starting with an initial population
of individuals representing encoded partitions. These
initial partitions are constructed applying the MOCC
algorithm sketched in [4]. Within the EAs, partition
valuation appears as realization of the ¯tness function.
In the following we present results of simulat-
ing models of real processor structures with parallel-
TEXSIM on an IBM SP2 system. Measured run-time
is compared with estimated run-time (5.2) resulting
from partition valuation (related to one clock-cycle in
both cases). We consider essential parts of two proces-
sor designs: IBM S/390 G1 and PowerPC 604.
The S/390 G1 model contains 181 418 boxes yield-
ing 22 034 cones. Our prediction for the sequen-
tial simulation run-time of one clock-cycle amounts
to 14:513 ms in comparison to a measured value of
15:393 ms. The following two tables give the pre-
dicted and measured (by simulation) parallel run-times
and the corresponding speedups for two di®erent pre-
partitioning procedures and three di®erent block num-
bers:
pre-partitioning MT to 500 STEP to 500
algorithm super cones super cones
predicted measured predicted measured
run-time run-time run-time run-time
2 processors 10:02 10:60 9:01 9:99
3 processors 9:81 10:48 7:42 8:12
4 processors 8:26 8:89 6:22 6:92
pre-partitioning MT to 500 STEP to 500
algorithm super cones super cones
predicted measured predicted measured
speedup speedup speedup speedup
2 processors 1:45 1:44 1:61 1:54
3 processors 1:48 1:47 1:95 1:90
4 processors 1:76 1:73 2:33 2:22
The PowerPC 604 model contains 319 543 boxes
yielding 42 176 cones. The predicted sequential run-
time related to one clock-cycle amounts to 38:345 ms
in comparison to a measured value of 36:382 ms. The
6
following two tables contain experimental results orga-
nized as in the case above.
pre-partitioning MT to 500 STEP to 500
algorithm super cones super cones
predicted measured predicted measured
run-time run-time run-time run-time
2 processors 25:24 24:47 22:25 21:20
3 processors 21:21 21:12 17:75 17:41
4 processors 19:37 19:06 14:38 13:82
pre-partitioning MT to 500 STEP to 500
algorithm super cones super cones
predicted measured predicted measured
speedup speedup speedup speedup
2 processors 1:52 1:49 1:72 1:72
3 processors 1:81 1:72 2:16 2:09
4 processors 1:98 1:91 2:67 2:63
The coincidence of predicted and measured run-
times for both processor models are encouraging for
applying this method of run-time estimation (5.2) to
the valuation of partitions of actual processor models
with essentially more boxes.
7 CONCLUDING REMARKS
We have presented a new method of parameterized
partition valuation for application in model partition-
ing in the context of parallel logic simulation based on
the clock-cycle algorithm over loosely-coupled proces-
sor systems. It represents a combination of workload
and interprocessor communication aspects and allows
performance estimation of corresponding simulation.
We have developed our method in the context of par-
allelTEXSIM. First experiments with models of real
processor structures using this simulator are encourag-
ing for further development of our valuation method
and its integration into new partitioning algorithms.
Our technique is not bound to a special simulator.
Currently, work is in progress to optimize algo-
rithms for the realization of partition valuation. In
future work we want to distinguish classes of actions
regarding to their contribution to simulation run-time.
Furthermore, heterogeneous target architectures for
parallel simulation will be subject of our investigations.
ACKNOWLEDGEMENTS
This work was partly supported by Deutsche
Forschungsgemeinschaft (DFG) under grant
Sp 487/1-2. We would like to thank K. Lamb,
H.-W. Anderson (both at IBM BÄoblingen), D. Zike
and W. Roesner (both at IBM Austin, TX) for pro-
viding SP2 access, software support andmany valuable
discussions. Moreover, we are grateful to our students
D. DÄohler, R. Reilein, H. Hennings and Th. Siedschlag
for their e®orts in algorithm implementation and test-
ing.
REFERENCES
[1] W. G. Spruth, The Design of a Microprocessor
(New York, Berlin, Heidelberg: Springer, 1989).
[2] D. DÄohler, Entwurf und Implementierung eines
parallelen Logiksimulators auf Basis von TEXSIM
(Leipzig: University of Leipzig, Department of
Mathematics and Computer Science, Diploma
Thesis, 1996).
[3] W. Roesner, TEXSIM for loosely coupled
multi-processors - performance estimates, sizing
(BÄoblingen: IBM, 1993).
[4] K. Hering, R. Haupt, and T. Villmann, Hierar-
chical strategy of model partitioning for VLSI-
design using an improved mixture of experts ap-
proach, Proc. of 10th Workshop on Parallel and
Distributed Simulation, 1996, 106{113.
[5] N. Manjikian, High performance parallel logic
simulation on a network of workstations (Water-
loo: University of Waterloo, Department of Elec-
trical and Computer Engineering and Computer
Communications Network Group, Technical Re-
port CCNG T-220, 1992).
[6] R. B. Mueller-Thuns, D. G. Saab, R. F. Damiano,
and J. A. Abraham, VLSI logic and fault simula-
tion on general purpose parallel computers, IEEE
Transactions on Computer-Aided Design of Inte-
grated Circuits and Systems , 12(3), 1993, 446{
460.
[7] C. J. Alpert and A. B. Kahng, Recent directions
in netlist partitioning : a survey, INTEGRATION
the VLSI Journal, 19, 1995, 1{81.
[8] K. Hering, Parallel cycle simulation (Leipzig:
University of Leipzig, Institute of Computer Sci-
ence, Technical Report 13, 1996).
[9] D. Culler, R. Karp, D. Patterson, A. Sahay, K. E.
Schauser, E. Santos, R. Subramonian, and T. von
Eicken, LogP: Towards a realistic model of paral-
lel computation, 4th ACM SIGPLAN Symposium
on Principles and Practice of Parallel Program-
ming, 1993, 1{12.
[10] Z. Xu and K. Hwang, Early prediction of MPP
performance : The SP2, T3D and Paragon expe-
riences, Parallel Computing 22(7), 1996, 917{942.
7
