Real-time dynamic simulation of the Cassini spacecraft using DARTS. Part 2: Parallel/vectorized real-time implementation by Man, G. K. et al.
N94-14639
Real-Time Dynamic Simulation
of the Cassini Spacecraft Using DARTS
Part II: Parallel/Vectorized Real-Time Implementation
A. Fijany, J.A. Roberts, A. Jain, and G.K. Man
Jet Propulsion Laboratory, California Institute of Technology
4800 Oak Grove Drive, Pasadena, CA 91109
1. Introduction
Part I of this paper presented the requirements for the real-time simulation of Cassini
spacecraft, along with some discussion of DARTS algorithm. Here, in Part II we discuss the
development and implementation of parallel/vectorized DARTS algorithm and architecture
for real-time simulation. Development of the fast algorithms and architecture for real-time
hardware-in-the-loop simulation of spacecraft dynamics is motivated by the fact that it
represents a hard real-time problem, in the sense that the correctness of the simulation
depends on both the numerical accuracy and the exact timing of the computation. For
a given model fidelity, the computation should be completed within a predefined time
period. Further reduction in computation time allows increasing the fidelity of the model
(i.e., inclusion of more flexible modes) and the integration routine.
An analysis based on the computational structure of DARTS and the specific dynamic
model of the spacecraft is made to determine efficient algorithmic/architectural techniques
for achieving real-time simulation capability. This analysis indicates that a combined
parallel/vector algorithmic technique along with a multiple vector processors architecture
represents the most efficient and cost-effective approach.
The most important (and the new) issue in this paper is the development of the vec-
torized algorithms for spacecraft dynamic simulation. Until recently, only the users of
vector supercomputers for non-real-time applications were concerned about the vectoriza-
tion issue. Usually, the vectorization was limited to the use of the automatic vectorizers,
provided by the vector supercomputers vendors, using an already developed software code.
This represents a suboptimal use of the vector supercomputers computing power since the
automatic vectorizers have a very limited capability and are efficient only for low level
vectorization. For most problems, a significant speedup can be achieved by developing a
new algorithm or restructuring the old algorithm by global vectorization of the compu-
tation. However, due to the non-real-time nature of the applications, the fact that the
vector supercomputers even in scalar mode (for serial computation) were faster than any
other serial processor, the time and effort required for the development of new vectorized
algorithms and software codes, the users were, most often, satisfied by the suboptimal per-
formance. As a result, the development of the vectorized algorithms has been studied for
3O7
https://ntrs.nasa.gov/search.jsp?R=19940010166 2020-06-16T19:42:14+00:00Z
very few and mostly regular problems, e.g., matrix-vector operations, direct and indirect
(iterative) linear system solution, etc. To our knowledge, the vectorization of multibody
dynamics has been only recently studied for a rather simple caseof a serial chain of rigid
multibody (a robot manipulator) on Cray supercomputers [5,6].
However, this situation is rapidly changing and more effort is being made on the
analysisand designof vectorized algorithms and software. This is motivated by the advent
of a new generation of low-cost single-board vector processorswith computational powers
previously offered by the vector supercomputers. These new vector processorsprovide,
for the first time, the opportunity for the designand implementation of high performance
embeddedcomputer architecture for real-time applications. However,in order to meet the
real-time constraint, efficient vectorized algorithms need to be developedfor exploitation
of computing power of thesenew vector processors.
The hardware and software considered in Part II of this paper represent the first
generation of such low-cost vector supercomputers. As such, they lacked some important
features for efficient vectorization and parallelization. However, since the development
of this work, significant improvement has beenmade on hardware and software of new
generations. The approach developed in Part II of this paper and the lessonslearned
through practical hardware and software implementation, along with theseadvancednew
generationsof multiple vector processors,indicate that the real-time simulation capability
for more complex systemssuchas the SpaceStation is now achievable.
Part II of this paper is organizedasfollows: Section2 reviewsthe different algorithmic
and architectural techniquesfor fast implementation of DARTS, and discussesthe features
of selectedtarget architecture; Section 3 discussestechniquesfor global vectorization and
efficiencyof vector algorithms; Section 4 discussessomeimplementation issuesand several
aspectsof the implemented algorithms through examples;and finally, Section 5 contains
somediscussionand concluding remarks.
2. Algorithm and Architecture Selection For Real-
Time Simulation
A. AN ANALYSIS OF ALGORITHMIC / ARCHITECTURAL TECH-
NIQUES FOR FAST IMPLEMENTATION OF DARTS
Generally, there are three algorithmic/architectural techniques that can be used to
speed up the computation of a given problem: symbolic manipulation, parallelization, and
vectorization. The choice of one or a combination of these techniques depends on: (1) the
structure of computational problem, and (2) the availability and cost effectiveness of the
required computer architecture.
Symbolic manipulation is a rather straightforward technique that is widely used in
multibody dynamics community (see, for example, [12]). Using this technique, a greater
computational efficiency can be achieved by eliminating the redundant operations and r
308
ReaclJon Wheels
LPPP
10
! DOF
HPSP
DOF BUS
2 DOF 2 DOF
Fuel Tanks Engines
Figure 1. Cassini Spacecraft Star-Topology Dynamics Model and Computational
Steps of DARTS
SUN Workstatlon
External Host
L
I 68o30 SBCInternal Host
ETHERNET
i-tlL- iVector Processor 1 Vector Processor 2
1 I
VME Bus
I Real-Time Control I
Computer and
Flight Hardware
I,i
i Memory Reflector L
I
I
Figure 2. Dedicated Parallel/Vector Computer Architecture for Real-Time
Dynamic Simulation of Cassini Spacecraft
3O9
generating the symbolic expressions for the equations. However, two issues regarding the
application of the symbolic manipulation technique need to be considered. First, the
speedup due to the symbolic manipulation should be analyzed in a relative context and
not as an absolute one. That is, if the original algorithm has a compact, efficient, and
recursive structure-whlch is the case for DARTS-then the use of symbolic manipulation
will not result in a noticeable speed-up. Second, the evaluation of the symbolic expressions
is a strictly serial computation. Hence, if symbolic manipulation is used, then it would be
difficult to further reduce the computation time by parallelization and/or vectorization. In
this case, the only way to reduce the computation time is to use a faster serial processor.
However, both the structure of DARTS and the specific model (star topology and
flexible bus) of the Cassini spacecraft make the computation highly suitable for paral-
lelization and/or vectorization. For parallel computation, at first glance it may seem that
the computation can be fully parallelized by assigning one processor per body. However,
as discussed below, this will lead to a limited speed-up. For vector computation, a large
part of the computation can be described in terms of two basic operations: scatter and
gather operations, which are highly suitable for vectorization since they involve operations
on large matrices and vectors. Furthermore, the size of matrices and vectors increases with
both the number of flexible modes and the number of appendages. In order to better assess
the suitability of the computation for parallel and/or vector computation and analyze the
resulting algorithmic/architectural trade-offs, a more careful study of the structure of the
computation is needed. Note that in this study we are only interested in the coarse grain
parallelism since it can be exploited by low-cost, commercially available, multiprocessor
architecture.
The basic computational steps of the DARTS for the Cassini spacecraft (Figure 1)
can be summarized as follows (see [1,3] for a more detailed discussion):
Step I:
1. Propagate the linear and angular velocity from the bus to appendages.
This step is suitable for both parallelization and vectorization. It can be done in
parallel for all appendages. It also represents a scatter operation and can be done by
performing a single, large matrix-matrix multiplication (see Section 4).
2. Compute the gyroscopic accelerations and forces of the appendages.
This computation is more suitable for parallelization since it can be done in parallel
for all appendages. It involves matrix-vector operations with rather small vectors and
matrices which makes it less efficient for vectorization (see also Section 4).
310
Step II:
1. Propagate Articulated-Body Inertia from appendages to the bus and com-
pute the Articulated-Body Inertia of the bus.
The propagation of the Articulated-Body Inertia from appendages to the bus can be
performed in parallel for all appendages. But the computation of the Articulated-Body
Inertia of the bus remains serial and also involves many-to-one type of interproces-
sor communication. However, both the propagation of the Articulated-Body Inertia
from appendages and the computation of Articulated-Body Inertia of the bus can be
described in terms of gather operations, which involve matrix-matrix multiplications
with very large matrices and, hence, they are highly efficient for vector computation.
2. Propagate the residual forces from appendages to the bus and compute
the effective residual forces of the bus.
Again, the propagation of the residual forces from appendages to the bus can be per-
formed in parallel for all appendages. But the computation of residual force of the bus
remains serial and also involves many-to-one type of inter-processor communication.
However, both the propagation of the residual forces and the computation of residual
force of the bus can be described in terms of gather operations, which involve matrix-
vector multiplications of large matrices and vectors and, hence, are highly efficient for
vector computation.
Step III:
1. Compute the acceleration of bus.
The computation of acceleration of bus involves the solution of a symmetric, positive
definite, linear system which is more suitable for vectorization than for coarse grain
parallelization.
2. Propagate acceleration of bus to appendage.
As in Step 1.1, this propagation can be performed in parallel but it also involves one
type to many types of interprocessor communication. It also represents a scatter
operation and can be done by performing a single, large matrix-vector operation.
3. Compute hinges acceleration.
Similar to Step 1.2, this computation is more suitable for parallelization since it can be
done in parallel for all hinges. It involves matrix-vector operations with rather small
vectors and matrices which makes it less efficient for vectorization.
The above analysis clearly suggests that the computation of DARTS for the Cassini
311
spacecraft can be speeded up by both parallelization and vectorization. Furthermore, a
combined parallelization/vectorization algorithmic approach can lead to a speed-up greater
than that achievable by either parallelization or vectorization alone. This combined algo-
rithmic approach is further motivated by the emergence of low-cost multiprocessor archi-
tectures that employ vector processors, such as Intel i860, as the node processor.
There are, however, two issues that need to be considered in applying this combined
algorithmic approach, which also can affect the choice of an optimal target architecture
for its implementation. The first issue is that a limited speed-up can be achieved by
assigning one processor per body since the ratio of fully parallelizable computations over
strictly serial computations isn't very large. This is mainly due to the specific model of
Cassini spacecraft; that is, rigidity of the appendages and high flexibility of the bus. In
fact, most of the fully parallelizable parts of the algorithm involve the computations for
the rigid appendages; e.g., Steps 1.2 and III.3, which are less intensive than the strictly
serial parts that involve the computations for flexible bus, and also the computation of
Articulated-Body Inertia (Step II.1) and acceleration (in Step II.1) of the bus.
Another important factor for efficient parallelization of the computation is the proces-
sors interconnection. As stated before, the parallelization of DARTS for Cassini spacecraft
involves many-to-one and one-to-many types of processors communication. Therefore,
without an interconnection structure that can handle fast processors communication of
these types, the communication overhead can degrade the achievable speed-up.
The second issue is that there is a trade-off between the degree of parallelization (i.e.,
number of processors), and the degree of vectorization. To see this, let us consider those
steps that are suitable for both parallelization and vectorization (Steps 1.1, II.2, III.2, etc.).
For example, in Step I.l., with one processor per appendage, the propagation can still be
done by performing matrix-vector operations with small matrices. However, if the number
of processors is reduced-which also reduces the speedup due to the parallelization-then
the propagation for more than one appendage is done by each processor which implies that
the size of matrices and hence the speedup due to vectorization increases.
B. THE CHOICE OF TARGET ARCHITECTURE
Based on the above analysis and given the possible options on the commercially avail-
able low-cost multiprocessor architectures (at the time of this project) and other constraints
on cost, hardware and software development time and effort, we chose a two-vector pro-
cessors architecture [2]. This choice was based on our conclusion that, in order to speed up
the computation, it was more efficient (both from an algorithmic and architectural point
of view) to exploit a limited parallelism but attempt to exploit maximum vectorization.
Figure 2 shows the dynamic simulation system [2]. It consists of a SUN workstation
and a VME subsystem. The SUN workstation is the host of system, which is used for
software development, and is interconnected through ETHERNET to the VME subsystem.
The VME subsystem includes a general-purpose single-board computer based on 68030
312
processor, which is the local host of VME subsystem, two single board vector processors,
and a memory reflector board for high speed interface with another VME system, the
real-time control computer.
Each vector processor is a SKYbolt VME bus compatible board with an i860 as the
vector processor and an i960 as the communication processor. The choice of the SKYbolt
over other commercially available i860-based boards was mainly due to a faster main
memory [2]. The SKYbolt was the only one that provided a SRAM main memory while
the others had a DRAM main memory with a memory bandwidth of half of that of SRAM
main memory. As will be discussed in Section 5, our practical implementation showed that
the choice of the SRAM main memory resulted in a very decisive factor in meeting the
real-time constraint.
The VME compatibility was basically required for the purpose of integration with and
the interface to the rest of the spacecraft hardware-in-the-loop simulation hardware. Based
on the vendor's specification, the SKYbolt board provided three communication channels
through the VME port, VSB port, and AUX (a fast and private I/O) port [13]. Therefore,
it was originally assumed that the communication between the two SKYbolt boards would
be performed by using the fast AUX port. However, neither AUX port nor VSB port
were functional at the time of our implementation. This forced us to use the VME bus as
the communication bus between the two SKYbolt boards. However, the VME interface
chips on the SKYbolts were not fully functional. This resulted in a significant loss in the
communication speed between the two SKYbolts compared to the nominal speed of the
VME bus. As a result, our system was highly imbalanced for parallel computation since
the processors' computation speed (particularly in full vector mode) was much greater than
the bus' communication speed. This implied that only very coarse grain parallelism with
minimum communication requirement could be efficiently exploited by the system. Note
that, even by using a fully functional AUX channel the system would have still remained
imbalanced. This clearly indicates that, without using an extremely fast communication
structure, efficient parallel computation with multiple vector processors such as i860 would
not be possible (see Section 5).
3. Vectorization Strategy
The SKYbolt can be used as an accelerator, i.e., simply as a fast serial processor,
to speed up the serial computation. According to the vendor's claim, in accelerator
mode the SKYbolt can provide a speed-up of about 2 over the SUN Spare II. Our double
precision implementation of serial DARTS algorithm on both the SUN Spare II and the
SKYbolt showed a similar speedup (Table I). This result indicated that, in order to meet
the real-time constraint, a greater speedup through vectorization of the algorithm was
needed.
The i860 has peak computational power of 80 and 64 MFLOPS for single- and double-
precision computation. It has most of the functional units of vector supercomputers.
However, the vector supercomputers, such as the Cray series, in addition to a fast vector
313
processingunit, also have a fast (usually the fastest available) scalar processor for serial
computation [8,10]. For i860, both vector and serial parts of the computation are per-
formed by the same units. As a result, the i860 has a poor ratio of speed of serial over
vector computation because the speed of serial computation (as can be seen by the above
comparison with SUN Spare li) is much less than that of vector computation. Thus, a
higher degree of vectorization, even more than that for vector supercomputers, is required
for i860 to achieve a satisfactory sustained computation power. -.
The SKYbolt also provides a software environment almost similar to that offered
by the vector supercomputers. However, the i860 and the SKYbolt represent the first
generation of such low-cost vector processors and, therefore, lack many important features
needed for efficient vector computation (see Section 5). Nevertheless, the strategy for
vectorization on the SKYbolt is basically similar to that for other vector processors. In
the following, we briefly discuss some of the key issues that have been considered in the
design, analysis, and implementation of the vectorized version of a DARTS algorithm.
A. LOCAL AND GLOBAL VECTORIZATION TECHNIQUES
There are two techniques for vectorization of a given computation [9,10]: local and
global vectorization. The SKYbolt, like vector supercomputers, provides two tools for local
vectorization. The first tool is a library of highly optimized routines for matrix-vector and
other operations that can be used by subroutine calls. The second tool is an automatic
vectorizer that analyzes the data dependency and then vectorizes the computation of in-
nermost Do-loops (i.e., scalar loops) of the overall computation [7,8-10]. Another widely
used technique not in the above computation is chaining of the operations [8-10].
However, if the matrices and vectors are small, then the use of optimized routines
does not significantly increase the performance. Also, if a code is already developed and
optimized for serial computation, then it may have strong data dependency in which even
the most advanced automatic vectorizers cannot vectorize. For most problems a greater
speedup can be achieved by recasting the algorithm in a form suitable for vector computa-
tion, i.e., by a global vectorization. This is more difficult than the local vectorization and
can be only done by the algorithm designer [9,10] as it may require major restructuring
of the data and computation of the algorithm. In Section 4, we discuss some examples of
such global vectorization. It should be also mentioned that our practical implementation
indicated that even efficient use of library routines may require restructuring of the com-
putation. There are not well-defined techniques for global vectorization and it is indeed
highly problem dependent [9]. Nevertheless, there are several key issues regarding efficient
vectorization that need to be taken into account in the design and analysis of vectorized
algorithms. These issues are briefly reviewed here. The reader is referred to [7, 9-11] for a
more detailed discussion.
B. THE EFFICIENCY OF VECTOR ALGORITHMS
The speedup of vectorized algorithms, like parallel algorithms, is measured according
314
to the Amdahl's Law.
Let f represent the vectorized fraction of the computation, k the speed of vector
operations relative to the speed of scalar operations, and SP the speedup of the vectorized
algorithm over serial algorithm. It follows then that
1
sP =
- f) + f /k
In order to increase the speedup, f and k should be increased, k is a function of the size
of vectors and matrices as well as the type of matrix-vector operation. The computation
time, T, of a vector operation is given as:
T = r + nt (2)
where n and t stand for the size of the vectors and the clock time of the vector processor, r
represents the overhead of vector operation due to the loop setup, load and store operations,
etc. If n is large enough so nt >> r then the computation of vector operation is
dominated by nt. That is, k is maximized and the vector processor performs one operation
per clock cycle.
There is a vector size below which vector computation becomes less efficient than
scalar computation. This size is called breakeven point [7] and is designated as nB . The
value of nB depends on the type of operation. There is no information on n for the i860.
Although originally we suspected nB to be rather small [3], in practical implementation
and for various matrix-vector operations we found nB to be quite large (several times that
for Cray series), which indicates that only operations on very large matrices and vectors
can be efficiently implemented on the i860.
As a conclusion, in vectorizing the algorithm an attempt should be made to:
(1) increase the number of matrix-vector operations, and hence increase f; and
(2) increase the size of the vectors and matrices, n, so that n >> nB , and hence increase
k.
C. MEMORY BANDWIDTH AND DATA ORGANIZATION
For vector processing, the data movement may sometimes take more time than the
computation (see the example in Section 4). Therefore, the second issue in analyzing
the performance of vector supercomputers is the data structure. To efficiently use the
high speed floating-point units, data should be fed with adequate speed. In the pipelined
mode, the i860 can initiate two floating-point operations (one add and one multiply) per
clock. This requires fetching four operands and storing two results per clock which indeed
requires a very high bandwidth memory. To achieve such a high bandwidth, the vector
315
supercomputers use a hierarchical memory organization. However, in addition to the
memory organization, the data structuring is also neededfor achieving and maintaining
the high bandwidth. For example,while the i860canperform two floating-point operations
per clock cycle, fetching an operand from an arbitrary location in the main memory can
take severalclock cycles.
The memory organization of vector supercomputers usually includes a set of registers,
as a fast and limited size memory; a cache memory, as medium-size medium-speed memory;
and a main memory, as a large and slow memory. The i860 has a set of thirty two 32-bit
data registers and 8 Kbyte (1 K double-precision) data cache. The selected SKYbolt board
provides a 2-Mbyte fast SRAM memory as the main memory. Unlike the register-oriented
vector supercomputers, such as Cray series, which utilize a larger size register (in the order
of Kbyte), the i860 has a rather small size set of registers. However, it is claimed that the
cache memory can be used with the same performance as the registers for vector operations
[4].
To minimize the data movement overhead, the following issues need to be considered:
1. Data Contiguity:
The related data should be locatedl as much as possible, in the contiguous locations
in the cache and main memory. Obviously for vector operation the elements of the
vectors (and matrices) should be stored in contiguous locations, i.e., with unit vector
stride. The vector instructions that access memory have a known pattern and if the
elements of vectors (matrices) are all adjacent, then the maximum speed in data access
is achieved by pipelining.
2. Data Locality:
Given the slow speed of the main memory, the access to the main memory should be
minimized. This implies that the intermediate data should be kept in the registers and
cache memory. Also, once data is fetched from the main memory and loaded into
the cache, all of the operations that require the data should be performed before
the data is returned to the main memory, i.e., the vector touch should be minimized.
Given the limited size of the cache, this may even require reordering the computation.
As a conclusion on the design of efficient vector algorithms, we would like to quote
from [11, p. 47], "We have shown that the efficiency of a vector- pipeline matrix computa-
tion depends upon the vector length, the vector stride, the vector touch, and the data re-use
properties of the algorithm. Optimizing with respect to all these attributes is very compli-
cated and something of an art. A good compiler can of course do some of the thinking for
us, but do not count on it!"
Note that in [11] general matrix computations are considered which are much simpler
than a rather complex algorithm such as DARTS.
316
4. Development and Implementation of Vectorized /
Parallel DARTS
Based on the analysis of Sections 2 and 3, we first developed a parallel/vectorized
version of DARTS [3]. However, the practical implementation of this algorithm resulted in
an interactive vectorization process. Detailed timing was used to measure the computation
time of each subroutine and the overall computation. The data structure and operations
of the algorithm were then constantly modified to minimize the computation time. As a
result, the final implementation was different from the original algorithm in [3]. Two issues
made these modifications necessary. First, the algorithm in [3] was based on general and
theoretical assumptions regarding vector processing. Given the fact that this was our first
experimentation, many lessons were learned on detailed practical issues through actual
implementation. The second, and more important, issue was due to the shortcomings of
both hardware and software of the SKYbolt. Some of the necessary routines either were
not provided or were not functional. Also, no means was provided to control the cache
memory (see also Section 5). As a result, we were forced to develop our own subroutines
or to change the computation. Here, we discuss some of the implementation issues. Due
to the lack of space, only a few representative examples are given.
A. SCATTER OPERATIONS: VELOCITY PROPAGATION
The propagation of velocities is a simple, but representative, example that shows
how the topology of the spacecraft allows efficient global vectorization of computation,
which follows m and n that stand for the number of bus flexible mode and the number
of appendages. Here, the main computation is the evaluation of the deformation variables
for all the appendages:
Fori=l ton,
rgl
j----1
(3)
m
j=l
(4)
m
= =
j=-I
(5)
m
3_(i) = Z "TiJilJ = "r(i)il
j=l
(6)
317
A
where r/= col{r/j} and r) zx col{r_}c_mx, are the vectors of modal deformation coordinates
of the bus. Aij and ")*ijC_ TM are the rotational and translational displacement vectors
of the jth mode for the ith appendage, A(i) _ row{A/j} and 7(Q _ row{'/i_/} e_ax'' "
8r(i),/St(i), 6,,(0, and 6v(i)eR a×_ are the translational and rotational deformation and the
linear and angular deformation velocities of appendage i. Due to the star topology of
the spacecraft, the propagation for all appendages can be done simultaneously, i.e., the
computation in Eqs. (3)-(6) can be done in parallel for all i = 1 to n.
For serial computation, the two forms in Eqs. (3)-(6) have the same cost while the
second form is more efficient for vector computation. This efficiency for vectorization can
be further increased as follows. Define
zx = { A } e_6,x,,"7) = {r/ _}c_'_×2; A zx col{A(/)) and 3, = co1{3,(i)}_3"x"; 1;I z_ 3'
&- _ co1{6,-(i)}, & = col{6e(i)},G, = col{&,(/)}, and _5,, = col{&,(i)}eRa"×';
8t and 8,or = 8v '
The computation in Eqs. (3)-(6) can then be performed by a simple matrix-matrix multi-
plication as:
6 = I:I_ (7)
Note that the matrix fl is constant and can be precomputed. Also, the above computation
results in a certain arrangement of the vectors 8,-(i),St(i),6,,,(i), and _v(i), which affects
the rest of the computation and should be taken into account. However, because of the
structure of the matrix H and the vector _ is not efficient to use, the regular matrix-matrix
multiplication routine is based on the vector-dot operation (see below).
B. SCATTER AND GATHER OPERATIONS: FORCE AND ACCELERA-
TION PROPAGATION
Using similar technique as for the velocity propagation, the propagation of force and
acceleration can be globally vectorized and represented in terms of large matrix-vector
multiplications as [3]:
Z(B) = ( c* ) Z+ + K = (H*¢)Z+ + K (8)
c_+ = (II¢') cffB) = (YI¢*)cffB) (9)
where Z(B) and _(B)e_ ¢'+6)×_ are the vectors of residual force and acceleration of the
bus. Z+(i) and _+(i)c_ TM are the residual force and acceleration of appendage i, and
Z + _ col {Z+(i)} and o_+ _ col {al+(i)}c_}_ 6nxl" rl¢_}_ 6nxm is an appropriate combination
of A(i) and 3'(Q and can be precomputed. ¢¢IR 6x6" is a sparse matrix that needs to be
formed in real time. In Eqs. (8)-(9), H*¢eN (m+6)x6,' and H¢*eN 6"x('n+6)
318
The matrix-vector multiplication routine provided by the SKYbolts (and other vector
supercomputers) is based on the vector-dot operation. Consider a matrix-vector multipli-
cation as V -" MU where M is a PxQ matrix and let M (i) and M(i) denote the rows and
columns of matrix M, respectively. The vector-dot based routine is given:
Fori=l toP,
V(i) = M (i) . U (10)
However, another possible algorithm for matrix-vector multiplication is based on the
SAXPY (scalar-vector multiply plus vector) operation [11]:
Fori=l toQ,
Yi= Y i-1 + M(i)U(i) (11)
Both the vector-dot and SAXPY operations are highly suitable for vector computation.
The operation in Eq. (10) requires P vector-dot operations on vectors of dimension Q
while that in Eq. (11) requires Q SAXPY operations on vectors of dimension P. Based
on our discussion in Section 3.B, it then follows that the P < Q, the vector-dot based
routine, and the P > Q, the SAXPY based routine, are more efficient.
For our implementation, the values of n and rn were n = 13 and m = 10. Thus, the
vector-dot routine is highly optimal for matrix-vector multiplication in Eq. (8) because
it requires 16 vector-dot operations on vectors of dimension 78. However, it is highly
inefficient for Eq. (9) as it requires 78 vector-dot operations on vectors of dimension 16. If
the SAXPY based routine is used for Eq. (9), then it requires only 16 SAXPY operations
on the vectors of dimension 78.
The C language was used for the development of our vectorized code, which implied
that the matrices are stored by rows. However, for efficient implementation of the SAXPY
based routine, matrices need to be stored and fetched by columns. For Eq. (9), the need
for transposing the matrix He* can be simply eliminated by rewriting it as:
= (a(B))*(n¢*)* = (12)
Another advantage of Eq. (12) is that for both Eqs. (8) and (12), only the matrix H*¢
needs to be formed.
Our SAXPY based routine, though developed in C language, significantly increased
the computational efficiency and was used very frequently. Obviously, if this routine is
provided by the vendors and developed in assembly language, it can offer an even greater
computational efficiency. Note that, for the matrix-matrix multiplication in Eq. (7), we
also used a SAXPY based matrix-matrix multiplication routine that is more efficient than
the vector-dot based routine.
319
C. COMPUTATION OF ARTICULATED-BODY INERTIA AND ACCEL-
ERATION OF BUS
The computation of the articulated-body inertia, P(B)_(m+_)×(m+6), and accelera-
tion, c_(B), of the bus represents the major computation-intensive parts of the vectorized
algorithm (over 30% of the total computation time). As stated before, the computation
of P(B) represents a gather operation and was globally vectorized in a similar fashion as
the computation of Z(B). However, significant reduction (more than 50%) in computation
time was achieved by several changes in the data structure and the type of operations to
find the most optimal way for computation of P(B). In the final form, the symmetry
of P(B) was exploited and only the diagonal and lower triangular parts of P(B) were
computed, a(B) is computed as the solution of the system.
P(B)c_(B)=¢(B) (13)
We first used a Cholesky-based routine provided by the SKYbolt's library for the solution
of the symmetric, positive definite, system in Eq. (13). Later, we developed a routine based
on the LDL* decomposition [11] that did not require square-root operation. Although our
routine was developed in C and was not vectorized, it was significantly faster than the
SKYbolt's routine. However, the main motivation for and the advantage of this routine
was that, given the way the matrix P(B) was computed, it could easily be used for solution
of Eq. (13) without any need for data movement. Again, if this routine is developed by
the vendor in assembly language and in fully vectorized form, it can offer an even greater
computational efficiency.
D. DATA MOVEMENT MINIMIZATION
Major improvement in the efficiency of the vectorized algorithm was achieved by
minimizing the data movement overhead through modification of the data structure and
operations of the algorithm. Here, a few examples are discussed.
1. Matrix-Matrix Multiplication
The computation of vectorized DARTS involves many matrix-matrix multiplications
as A = BC for both small and very large matrices. A vector-dot based matrix-matrix
multiplication routine requires that first the matrix C be transposed. However, if matrix
C is symmetric, then it does not need to be transposed. The SKYbolts provided two
matrix-matrix multiplication routines for the general (nonsymmetric matrix) case and the
special case (symmetric matrix). However, even for the general case, whenever possible
we eliminated the need to transpose the matrix C by either forming C* (if it could be
precomputed) or directly computing C*.
Another frequently used operation was chained matrix-matrix multiplication, as
A = BCB*, for both small and very large matrices and with C being a symmetric ma-
trix. For example, this type of matrix-matrix multiplication occurs in projection of mass
320
matrices of the appendages onto the bus frame or in computation of P(B). This opera-
tion can be performed without any need for matrix transposition by simply rewriting it
as A = B(BC)* . This simple modification resulted in a significant reduction of the data
movement overhead particularly for computation of P(B) wherein the matrices involved
in the computation were very large.
2. Vector Touch Minimization
One of the widely used operations in the vector processing is a GAXPY (matrix-vector
multiply plus vector) operation [11] as V1 = MV2 + 1/3 wherein M is a matrix and I/"1, V2,
and V3 are vectors. In addition to the computational efficiency, a GAXPY routine reduces
the vector touch since the vector V4 = MV2 does not need to be explicitly computed,
stored, and reloaded. However, the SKYbolt's library did not provide such a routine and
we had to develop our own routine. Several other routines were also developed for other
operations with the purpose of minimizing the vector touch.
3. Data Structure Modification
Our major effort in reducing the data movement overhead was based on modifying
the data structure of the algorithm to find the most optimal form. Here, we give a simple
example that underlines the importance of the data movement overhead minimization. The
computation times of evaluating angular, awi, and linear, avi , gyroscopic acceleration
of appendages for i = 1 to n, were measured as 137 #s and 121 t_s. For the rest of the
computation, it wasthen required to merge the vectors a_,,i and a,,i and form a vector
a i = . However, it took 143 ps to form the vectors ai for i = 1 to n which was
avi
greater than the computation time for either awl or avi. The algorithm was then modified
to directly compute and form the vectors ai without any data movement. This simple
example clearly shows that for vector processing the data movement time can be even
greater than the computation time.
E. GLOBAL VECTORIZATION OF SMALL MATRIX-VECTOR OPERA-
TION LOOPS
A rather significant part of the DARTS algorithm, which seemed to be unvectorizable,
involved many Do-loops with small vectors and matrices. An example of such frequently
occurring Do-loops is:
Fori= 1 ton,
Iq, = MiV2i + V3i (14)
where Vli,r2i , and Vai are 3 x 1 vectors and Mi is 3 x 3 matrix. Due to the small
dimension of vectors and matrix, it is more efficient to use scalar (serial) routines for such
Do-loops. However, we developed a technique for global vectorization of such Do-loops.
321
Algorithm
Serial DARTS
on SUN SPARC II
Serial DARTS
on 1 SKY Bolts
Vectorized DARTS
on one SKY Bolt
Parallel/Vectorized DARTS
on two SKY Bolts
Computation Time
(in ms)
24.39 ms
12.37 ms
7.29 ms
4.82 ms
Speedup
DI
(Reference Time)
2
Faster Hardware
1.7
Vectorization
1.5
Parallelization & Vectorization
Table I. Comparison of different algorithms/architecture computational efficiency
To see this, let us define
V 1 _-_ CO1 {Eli}, V 2 _ c01 {V2i},V 3 Lx col {Vai}e:_ a"xl, and M diag {Mi}e_, a"×a"
The above loop can then be replaced by a single matrix-vector multiplication:
171 = MI/) + V3 (15)
Of course, due to the sparse structure of matrix M, it is highly inefficient to compute Eq.
(15) by performing a general matrix-vector multiplication. However, the matrix M is a
banded matrix with the nonzero elements only on its five leading diagonals. The compu-
tation in Eq. (15) can be efficiently done by performing the matrix-vector multiplication
by diagonals. To see this, let M j, j = -2 to 2, denote the diagonals of matrix M, where
M ° is the main diagonal.
The computation in Eq. (15) can then be done:
Set 171-a = V3
For j = -2 to 2,
V] = MJ (D V2 + V_ -' (16)
322
r
where ® indicates element-by-element multiplication of two vectors. The element-by-
element vector multiplication plus vector operation in Eq. (16) is highly efficient for
vector computation and it is also provided by the SKYbolt's library. The efficiency of
this technique for global vectorization results from the fact that it involves five such vector
operations on large vectors. However, we did not implement this technique since the routine
provided by the SKYbolt's library was not functional for double precision. Furthermore, its
efficient implementation requires the minimization of the data movement overhead which
may occur in forming the diagonals of matrix M. This requires a direct control of cache
memory, which was not possible on the SKYbolt. Nevertheless, for future applications,
this technique is very promising as it allows the seemingly strictly serial Do-loops to be
vectorized.
5. Discussion and Conclusion
Table I shows a comparison of the different implementations of DARTS for a 13-body
and 10-flexible modes model of Cassini spacecraft. As stated before, the speedup of the
vectorized algorithm increases with the increase in the number of flexible modes and/or
the number of bodies. For a 13-body and 20-flexible modes model, the vectorized algo-
rithm achieves a speedup of 2 over serial DARTS on one SKYbolt. We did not discuss
the algolithm's parallelization in detail. Suffice it to mention that, despite using several
strategies to overlap the computation and communication as much as possible, the com-
munication overhead from the slow VME bus remained a major bottleneck, which explains
the rather poor speedup of parallelization.
Here, we would like to summarize some of the shortcomings of the SKYbolt and to
discuss some desired features.
A. DOUBLE-PRECISION PERFORMANCE
The i860 is claimed to be a 64-bit vector processor [4]. However, it has a 128-bit wide
data path, which means that only two double-precision operands can be simultaneously
loaded from or stored to the cache memory. This significantly reduces the speed of the
processor for those vector operations that involve three operands. We implemented our
vectorized algorithm with double precision. Although, we did not try the single-precision
implementation of the algorithm, given the high vectorization degree of our algorithm, a
much greater speedup can be expected for single-precision implementation.
B. LIMITED CACHE MEMORY AND LACK OF CACHE MANAGEMENT
The SKYbolt did not provide a means for managing the cache memory. Thus, we
could not further reduce overhead caused by the data movement between the cache and
main memory by explicitly defining the physical location of data in the cache memory and,
hence, increasing the data re-use. As a result, most of the computation was performed
on the data located in the main memory which, in addition to increasing the overhead,
significantly reduced the computation speed of both scalar and vector operations. Given
323
this extensive useof the main memory, a DRAM main memory with a slower speedover
an SRAM main memory, would haveincreasedthe overheadby a factor of 2.
For double precision, the sizeof i860 cacheis 1 K. However,the vectorized algorithm
involved the operations on matrices larger than the size of the cache. For example, the
matrix II*¢ in Eqs. (8) and (12) is of dimension 16 x 78 and the computation of P(B)
involves even bigger matrices. An efficient technique for handling such cases is the seg-
mentation of the computation [10]. However, this requires a direct control of the cache
memory which, as stated before, was not possible.
C. SKYBOLT'S LIBRARY
As stated before, the SKYbolt's library did not provide some of the useful routines
that were frequently used in our implementation. Also, some of the routines provided were
not functional either at all or for double precision.
Despite all the above shortcomings, the SKYbolt was highly cost effective and allowed
us to meet our goal (see Table I) with a relatively short development time. As stated
before, the SKYbolt represents the first generation of low-cost vector processors. The new
generations not only provide a drastic reduction in the cost over performance ratio, but also
significant improvements in both hardware and software. The size of cache memory in the
new versions of i860 has increased by a factor of 2. Single board multiple i860 processor-
based architectures [13] are now offered that present a much more balanced system for
parallel computation since the communication between processors can be performed on
board and via a fast interconnection network. The library routines are also improved.
In particular, based on our suggestions to the vendors, new routines including some of
the routines developed by us, e.g., the SAXPY-based matrix-vector and matrix-matrix
multiplication routines and LDL* routine for linear system solution,were added to the
library.
The results of our work, along with the significant improvements in both the price
and performance of these architectures, clearly suggest that the parallel/vector algorithms
and architectures present a highly efficient and low-cost approach for achieving real-time
simulation capability for even more complex and computationally demanding multibody
systems, such as the Space Station. In particular, it should be mentioned that the Space
Station has a star topology that allows the application of a similar global vectorization
strategy as for Cassini spacecraft. Also, due to the flexibility of Space Station appendages,
not only the computation for appendages can be vectorized but also, based on our analysis
in Section 2, more vector processors can be used to increase the speedup to the paralleliza-
tion.
Acknowledgments
The research described in this paper was performed at the Jet Propulsion Labora-
tory, California Institute of Technology, under contract with the National Aeronautics and
324
SpaceAdministration. The authors gratefully acknowledgethe assistanceof R. Graves,
K. Pendergast,and J. Hernandezin implementing the hardware and softwarediscussedin
Part II of thispaper.
References
1. A. Jain and G. Rodriguez,"Recursive Flexible Multibody System Dynamics using
Spatial Operators," J. Guidance, Control, and Dynamics, Nov. 1992. In Press.
, A. Fijany and J.A. Roberts,"Dedicated Computer Architecture for Real-Time Dy-
namic Simulation of CRAF/Cassini Spacecraft: Requirements and Specifications,"
JPL Engineering Memorandum EM 343-91-236 (internal document), Jet Propulsion Labora-
tory, Pasadena, California, March 1991.
. A. Fijany,"Parallel/Vectorized Algorithms for Real-Time Dynamic Simulation of
CRAF/Cassini Spacecraft," JPL Engineering Memorandum EM 343-91-1230 (in-
ternal document), Jet Propulsion Laboratory, Pasadena, California, June 1991.
4. L. Kohn and N. Margulis,"The i860 64-Bit Supercomputing Microprocessor," Proc.
Supercomputing 89, pp. 450-546, 1989.
. H. Cheng and K.C. Gupta,"Vectorization of Robot Dynamics on a Pipelined Vector
Processor," Proc. IEEE Int. Conf. on Robotics and Automation, pp. 96-101, April
1991.
° S. McMillan, D.E. Orin, and P. Sadayappan,"Real-Time Robot Dynamic Simulation
on a Vector/Parallel Supercomputer," Proc. of IEEE Int. Conf. on Robotics and
Automation, pp. 1836-1841, April 1991.
7. B.L. Buzbee,"A Strategy for Vectorization," Parallel Computing, Vol. 3, pp. 187-192,
1986.
8. I(.. Hwang and F.A. Briggs, Computer Architecture and Parallel Processing, McGraw-
Hill Inc., New York, New York, 1984.
9. K. Hwang,"Advanced Parallel Processing with Supercomputer Architectures," Proc.
of IEEE, Vol. 75(10), pp. 1348-1379, 1987.
10. K. Hwang and S-P Su,"Vector Computer Architecture and Processing Techniques,"
Advances in Computers, Vol. 20, pp. 115-197, 1981.
11. G.H. Golub and C.F. Van Loan, Matrix Computation, second edition, Johns Hopkins
University Press, Baltimore, Maryland, 1989.
12. D.E. Rosental and M.A. Sherman, "High Performance Multibody Simulation via Sym-
325
13.
bolic Equation Manipulation and Kane's Method," d. Astro. Sciences, Vol. 34(3), pp.
223-239, 1986.
SKY Computer Product Summary, Sky Computers Inc,. Chelmsford, Massachusetts,
April 1992.
326
327
0r--
o
mm
('3 _)'(n o
.c: "_ .--
0 .-,-, _ _. >,
(/) , _. _..'_'--
m
I==
°"_A" _ _o=,,, 8
328
c:E
OQ-
_0
m
0_ r-_
"_o__o .! ,-.o "o,,,•__,-_
_0 _D_- 0"__
•-0 LLI ,._.,,
_-uu
 .oo
329
l.O
@.a
m
330_i
03
ll
ll
(/)
a)
E
331
I\ \ \\_
\\\\_\\\\ ×_J_C
_O. uJ
rn
k-
\\_ _ o
\ \\\ _" "'
' t/) C_
332
OIlII
CO
%
m _
I--, In
333
¢tl
"O
O
>,,
m
E
tll
3
G)
t'll
ill
lib
"O
t"
_, r,,
o0')
rr
>.
W
_J
I
O-J
w_
0
bU
n_
0
o
0
_J
>.
n_
Z
m 0
LU
0
0
Z
0
m
<_
Z
=
l I I I I I I
334

336


®339
o_m
3
u G)QE_
3
(n
r-
4)
tn
C
0
u)
.u
r-l=
0; o
_E
E
UJ(_
34O
el m
3
°_
E
0
O o
t-. t::
"0
it..i.
tt_
ILl
=.E
w
e" "" t_
341
_e
U
t_
e_
"0
GI
0
e,
_o
_o
f_
t_
0342
==
/
343
_=
C/)
6 6
r- c
D. D.
0 (J
_ m
E _
-_ E
-- E
m o
U "0
m r-
U
¢'t
m _
344
l345



349

