Cycle-based simulation on loosely-coupled systems by Döhler, Denis et al.
CYCLE -BASED SIMULATION ON
LOOSELY -COUPLED SYSTEMS
Denis DÄohler, Associate Member, IEEE
IBM Germany, S/390 Development
P.O. Box 1380, 71003 BÄoblingen, Germany
doehler@de.ibm.com
Klaus Hering, Member, IEEE Wilhelm G. Spruth, Senior Member, IEEE
Department of Computer Science, University of Leipzig
Augustusplatz 10{11, 04109 Leipzig, Germany
fhering, spruthg@informatik.uni-leipzig.de
Abstract| Logic simulation is a crucial ver-
i¯cation task in processor design. Aiming at
signi¯cant acceleration of system simulation we
have parallelized IBM's cycle-based simulator
TEXSIM. Resulting parallelTEXSIM has al-
ready been employed successfully in simulating
S/390 architectures on IBM SP systems. Here
we present parallelTEXSIM together with its
model partitioning environment.
I. Introduction
Veri¯cation of processor designs is a real challenge
due to rapidly growing design complexity. Ensur-
ing correctness of logic designs containing millions
of gates requires a large amount of simulation. For
system simulation at register transfer level (RTL)
and gate level (GL) it has proven to be a good
practice to separate timing analysis from func-
tional veri¯cation (logic simulation). The applica-
tion of static timing veri¯ers and cycle-based sim-
ulators as tools targeted to the corresponding task
is advantage [9]. Considering functional veri¯ca-
tion, cycle-based simulators are signi¯cantly faster
than event-driven ones which o®er rich function-
ality at the cost of performance and memory uti-
lization [8]. Hardware accelerators as alternatives
to software simulators show high performance but
they are very expensive and di±cult to adapt to
changing simulation techniques.
Cycle-based simulation (CBS) can be ab-
stractly represented by the ALTER-CLOCK-
RETRIEVE pattern (Fig. 1), where CLOCK
means simulating a number of cycles, ALTER
stands for setting values of model components
from outside and RETRIEVE is related to the
check of such values. Formal description ap-
proaches for CBS can be found in [2] and [3].
With the availability of commercial programs
such as Cyclone (Synopsis) or SpeedSim/3 (Speed-
Sim) in recent years, CBS has become more and
more popular in industry. IBM has been doing
CBS for processor veri¯cation at least since the
early 80's. Work presented in this paper is related
ALTER CLOCK RETRIEVE
Fig. 1 { ALTER{CLOCK{RETRIEVE pattern
to the IBM internal cycle-based Texas Simulator
(TEXSIM) developed by D.S.Zike. Since the pro-
duction release of its ¯rst version in 1990 TEXSIM
has become the most important logic simulator
in the various IBM processor developments. It
de¯nes a client-server architecture o®ering a vari-
ety of interfaces for simulation control by di®erent
kinds of user programs. Simulation models are cre-
ated by a simulator speci¯c model compiler out of
a structural circuit description which is generated
by a HDL compiler in the context of the IBM De-
sign Automation Database (DA DB). Mixed level
models simultaneously containing RTL and GL
constructs are allowed. Originally, TEXSIM has
been used for regression runs following a random
simulation strategy based on a large number of
mainly automatically generated small test cases
(machine code or microcode sequences [1]) which
1
are (automatically again) distributed over a net-
work of workstations. For regression runs related
to complete processor models at system level [7]
the evaluation of complex test cases (representing
parts of boot processes, for instance) is very de-
sirable. To cope with the tremendous amount of
time required by such simulations we have par-
allelized TEXSIM based on the assumption that
corresponding simulation processes should spend
very little time in the ALTER and RETRIEVE
steps relative to the CLOCK step.
Our parallelization approach makes use of
model inherent parallelism (replicated worker
principle). A parallelTEXSIM cycle simulation ap-
pears as a co-operation of several simulator in-
stances evaluating parts of the original model,
which are de¯ned statically, on a loosely-coupled
processor system. Choosing cone-based model
partitioning gives the possibility of leaving the ker-
nel of the optimized TEXSIM simulation engine
unchanged. The performance of corresponding
parallel simulations strongly depends on preced-
ing model partitioning. Because of the relevance
of one partitioning for several extremely time con-
suming simulations over one and the same model
we allow complex partitioning algorithms regard-
ing both component communication and workload
aspects.
In Section II, our static model partitioning
strategy for parallel cycle-based simulation is out-
lined. Then, parallelTEXSIM is introduced in Sec-
tion III focussing on its general structure and per-
formance relevant issues. Experimental results re-
lated to parallel simulations of a real processor
model belonging to the IBM S/390 architecture
are presented in Section IV.
II. Model Partitioning
A. Overview
For parallelTEXSIM simulations a set of TEXSIM
models representing parts of an original design
has to be provided together with a cross-reference
list and one signal-cut list per model. The cross-
reference list contains all elemental design com-
ponents which are accessible from outside during
simulation and information about their distribu-
tion to models. Signal-cut lists comprise model-
related input and output points indicating com-
munication relations to other models. In Fig. 2
the process of model-related input generation for
parallelTEXSIM is depicted with protos embodying
DA DB objects for the structural representation of
RTL / GL designs.
Fig. 2 { Model-related parallelTEXSIM input
Partitioning starts from a proto generated by a
HDL compilation. The number of resulting protos
can be both given in advance or ¯xed by the par-
titioning process itself. For every proto delivered
a model building process realized by texbld (the
same as for sequential TEXSIM simulation) fol-
lows. The parallel simulator automatically adapts
to the number of models generated.
B. Partitioning Strategy
As basis for development and analysis of proto
partitioning algorithms we have introduced a for-
mal Structural Hardware Model (SHM) [5] em-
bodying a directed graph with vertices to be in-
terpreted as wires, latches, elements of combi-
natorial logic or input/output elements, respec-
tively. For each proto describing a synchronous
design a corresponding SHM can be constructed
easily. Then, the problem of proto partitioning
can be considered as a problem of SHM parti-
tioning. For a given SHM M , the fan-in cones
with heads representing latches or outputs form
the set Co(M) of basic building blocks for parti-
tions ofM . The latter ones appear as partitions of
Co(M) in a mathematical sense. In correspond-
ing parallel simulations, inter-processor communi-
cation between simulator instances is restricted to
cycle boundaries and can be realized by collective
communication operations. Obviously, cones may
overlap. Assigning overlapping cones to di®erent
partition components results in replication of sim-
ulation work.
2
The SHM partitioning problem can be stated
as NP-hard combinational optimization problem.
Using a formal model of Parallel Cycle Simulation
(PCS) [3] we have developed a parameterized cost
function [4] estimating upper run-time bounds for
simulations corresponding to a given SHM parti-
tion. This partition valuation function takes into
consideration both workload aspects and inter-
processor communication overhead.
Due to the complexity of models representing
complete processor structures we have introduced
a hierarchical partitioning strategy [5] allowing
combination and competition of partitioning al-
gorithms and a special merging method of their
results called superposition. Within a two-level
partitioning scheme after fast pre-partitioning (re-
ducing problem complexity) more expensive algo-
rithms working on the basis of hypergraph struc-
tures are applied. Speci¯cally, the employment of
Evolutionary Algorithms at the second level has
proven successful.
C. Partitioning Tools
A realization of the two-level partitioning scheme
mentioned above is given by PAMB (Partition-
ing And Model Build) which has been devel-
oped by R.Haupt. It comprises a C-library
(containing partitioning algorithms, functions for
proto handling, model build and the creation of
hypergraph structures) together with a script-
based application framework in the context of the
DA DB. PAMB will be integrated in our par-
titioning environment parallelMAP (Model Analy-
sis and Partitioning). The latter one embodies a
client-server architecture which allows the imple-
mentation of parallel partitioning algorithms on
message-passing basis.
III. The Parallel Simulator
A. General Structure
We have implemented parallelTEXSIM based on
the partitioning framework outlined above. A
production release of the simulator is employed
for regression runs in the IBM S/390 proces-
sor development allowing signi¯cantly larger test
cases. From the user's point of view, the par-
allel program o®ers the same options and inter-
faces as the sequential one. Designed to run on
loosely-coupled systems, it represents a master-
slave structure (Fig. 3) with component commu-
nication via message-passing. A master simu-
lator instance (mTEXSIM) derived from sequen-
tial TEXSIM provides Application Programming
Interfaces (APIs) to the environment and con-
trols a set of slave simulator instances (sTEXSIM).
These instances contain the original TEXSIM sim-
ulation engine encapsulated within a communica-
tion shell. parallelTEXSIM permits parallelization
of user programs for simulation control by assign-
ing corresponding program instances to simulator
instances. The software platform currently sup-
ported is IBM AIX/6000 with the IBM Parallel
Environment [6]. This allows networks of IBM
RS/6000 workstations and IBM RS/6000 SP ma-
chines as target hardware.
block
bl
oc
k
m
o
de
lblock
m
odel
Co
m
m
un
ic
at
io
n
model
A P I
Client Client Client
mTEXSIM
sTEXSIM
sT
EX
SI
M
Interconnection
Network
Dynamic Linking
Communication
Module
Communication
Module
Com
m
unication
M
odule M
od
ul
e
sTEXSIM
Fig. 3 { Parallel simulator structure
B. Facility Management
For referencing model elements (signals or arrays)
from outside during simulation, TEXSIM provides
the concept of facilities which are represented
by bit matrices. Model partitioning for parallel
simulation leads to distribution of accessible el-
ements of the original model to di®erent models
(described by a cross-reference list mentioned in
Section II) possibly related with replication and
splitting. Therefore the facility notion has been
extended to a facility hierarchy to ensure e±-
cient element referencing within parallelTEXSIM.
We have introduced global, parallel and local fa-
cilities. A local facility is a usual facility handled
3
by a slave simulator instance. The global facilities
serve to support the usual facilities on a master
simulator instance as if they were referenced on a
non-parallel simulator. In fact they are vectors of
parallel facilities and those are vectors containing
references to local facilities.
Moreover, we have introduced communication
facilities to achieve fast access to data structures
which are involved in collective communication op-
erations during parallel cycle simulation. These
data structures are related to cut signals which
arise from model partitioning. Cut signals are
made known to parallelTEXSIM via model-related
signal-cut lists (cf. Section II).
C. Component Co-operation
Within parallelTEXSIM three types of collective
message-passing are distinguished based on the
components involved: "master and one slave",
"master and all slaves" and "all slaves". Pro-
cesses using communication operations of the cor-
responding type are for instance facility referenc-
ing, creating an initial status protocol and simu-
lating cycles in parallel.
GET
TRANSFER
PUT
CLOCK
Fig. 4 { Parallel clock-cycle implementation
The main function of parallelTEXSIM is the
realization of the parallel clock-cycle as shown
in Fig. 4. During CLOCK one cycle is sim-
ulated on all slave instances over the corre-
sponding models without interaction. In the
GET step, these instances independently provide
model-related global output values within commu-
nication vectors. TRANSFER embodies a person-
alized all-to-all communication between the slave
instances. Thereby, communication vector com-
ponents are transferred between instances accord-
ing to their vector position. This results in new
model-related communication vectors. In the ¯-
nal PUT step, all slave instances (again indepen-
dent of one another) update facilities based on the
corresponding communication vectors.
Tab. 1. System simulation for an IBM S/390 pro-
cessor model with TEXSIM / parallelTEXSIM
Test Simulation Performance (cps)
Cases 1-way 4-way 8-way 12-way
pa013
PG060
vb010
7:60 22:46 30:43 34:67
7:75 22:42 31:76 35:78
7:70 23:71 33:47 38:93
IV. Experimental Results
In the following, we present performance values
for parallelTEXSIM running on an IBM SP2 sys-
tem (160 MHz Thin Nodes, 1 GByte RAM per
node, 97 MByte/s High Performance Switch) and
simulating a processor model of the IBM S/390
architecture containing 1 Processing Unit, 4 L2-
Caches, 8 Bus Switching Networks and 8 Stor-
age Controllers. The model consists of about 2.7
million elements at RTL / GL. We have chosen
a representative set of test cases which are mi-
crocode sequences of di®erent size (pa013 : 294
cycles, PG060 : 7464 cycles, vb010 : 31466 cy-
cles). Parallel simulations based on model parti-
tions with 4, 8 and 12 components are considered
in comparison with corresponding sequential (1-
way) TEXSIM simulations (Tab. 1). Performance
is measured in cycles per second (cps).
V. Concluding Remarks
We have presented parallelTEXSIM, a parallel logic
simulator running on loosely-coupled systems. It
essentially accelerates time consuming system sim-
ulation processes in the IBM S/390 processor de-
velopment. Simulation performance strongly de-
pends on preceding model partitioning. Our parti-
tioning environment allows consideration of work-
load and inter-processor communication aspects
via parameterized partition valuation functions.
In future work we will use our experiences with
parallelTEXSIM for the parallelization of the IBM
Multivalue Simulator MVLSIM. We will concen-
trate on problems related to multiply referenced
sub-designs, dynamic load balancing in multiuser
environments and parallelization of user programs
controlling simulation.
Acknowledgment
Work presented was supported by Deutsche
Forschungsgemeinschaft (DFG) under grant
4
Sp 487/1-1. We are grateful to K.Lamb et al.
(IBM Laboratories BÄoblingen) and W.Roesner
et al. (IBM Laboratories Austin (TX)) for valu-
able assistance. Special thanks to the research
workers and students involved in the project at
the University of Leipzig.
References
[1] A. Aharon, A. Bar-David, B. Dorfman, E. Gof-
man, M. Leibowitz, and V. Schwartzburd.
Veri¯cation of the IBM RISC System/6000 by
a dynamic biased pseudo-random test program
generator. IBM Systems Journal, 30(4), pages
1{81, 1991.
[2] D. DÄohler. Entwurf und Implementierung
eines parallelen Logiksimulators auf Basis von
TEXSIM. Diploma Thesis, Univ. of Leipzig,
Dept. of Mathematics, 1996.
[3] K. Hering. Parallel Cycle Simulation. Techni-
cal Report 13(96), Dept. of Computer Science,
Univ. of Leipzig, 1996.
[4] K. Hering, R. Haupt, and U. Petri. Parame-
terized Partition Valuation for Parallel Logic
Simulation. In Proc. of the Conference on
Parallel and Distributed Computing and Net-
works (PDCN'97), pages 144 { 150, 1997.
[5] K. Hering, R. Haupt, and T. Villmann. Hi-
erarchical Strategy of Model Partitioning for
VLSI{Design Using an Improved Mixture of
Experts Approach. In Proc. of the Confer-
ence on Parallel and Distributed Simulation
(PADS'96), pages 106{113, 1996.
[6] M. Snir, P. Hochschild, and D. D. Frye. The
Communication Software and Parallel Envi-
ronment of the IBM SP2 (PE). IBM Systems
Journal, 34(2), pages 185{204, 1995.
[7] W. G. Spruth. The Design of a Microproces-
sor. Springer Verlag, Berlin, 1989.
[8] C. C. Tung and C. Ussery. Face o® : Cycle{
Based vs. Event Driven Simulation. Com-
puter Design's ASIC DESIGN, pages A14{
A17, 1994.
[9] K. Westgate and D. McInnis. Reducing
Logic Veri¯cation Time with Cycle Simula-
tion. SpeedSim, Published in the Internet,
http://www.speedsim.com/tech/cyc sim.htm,
1996.
5
