3 research outputs found
Exploiting Fine-Grain Concurrency Analytical Insights in Superscalar Processor Design
This dissertation develops analytical models to provide insight into various design issues associated with superscalar-type processors, i.e., the processors capable of executing multiple instructions per cycle. A survey of the existing machines and literature has been completed with a proposed classification of various approaches for exploiting fine-grain concurrency. Optimization of a single pipeline is discussed based on an analytical model. The model-predicted performance curves are found to be in close proximity to published results using simulation techniques. A model is also developed for comparing different branch strategies for single-pipeline processors in terms of their effectiveness in reducing branch delay. The additional instruction fetch traffic generated by certain branch strategies is also studied and is shown to be a useful criterion for choosing between equally well performing strategies. Next, processors with multiple pipelines are modelled to study the tradeoffs associated with deeper pipelines versus multiple pipelines. The model developed can reveal the cause of performance bottleneck: insufficient resources to exploit discovered parallelism, insufficient instruction stream parallelism, or insufficient scope of concurrency detection. The cost associated with speculative (i.e., beyond basic block) execution is examined via probability distributions that characterize the inherent parallelism in the instruction stream. The throughput prediction of the analytic model is shown, using a variety of benchmarks, to be close to the measured static throughput of the compiler output, under resource and scope constraints. Further experiments provide misprediction delay estimates for these benchmarks under scope constraints, assuming beyond-basic-block, out-of-order execution and run-time scheduling. These results were derived using traces generated by the Multiflow TRACE SCHEDULING™(*) compacting C and FORTRAN 77 compilers. A simplified extension to the model to include multiprocessors is also proposed. The extended model is used to analyze combined systems, such as superpipelined multiprocessors and superscalar multiprocessors, both with shared memory. It is shown that the number of pipelines (or processors) at which the maximum throughput is obtained is increasingly sensitive to the ratio of memory access time to network access delay, as memory access time increases. Further, as a function of inter-iteration dependency distance, optimum throughput is shown to vary nonlinearly, whereas the corresponding Optimum number of processors varies linearly. The predictions from the analytical model agree with published results based on simulations. (*)TRACE SCHEDULING is a trademark of Multiflow Computer, Inc
Parallel alogorithms for MIMD parallel computers
This thesis mainly covers the design and analysis of asynchronous
parallel algorithms that can be run on MIMD (Multiple Instruction
Multiple Data) parallel computers, in particular the NEPTUNE system at
Loughborough University. Initially the fundamentals of parallel computer
architectures are introduced with different parallel architectures being
described and compared. The principles of parallel programming and the
design of parallel algorithms are also outlined. Also the main
characteristics of the 4 processor MIMD NEPTUNE system are presented,
and performance indicators, i.e. the speed-up and the efficiency factors
are defined for the measurement of parallelism in a given system.
Both numerical and non-numerical algorithms are covered in the
thesis. In the numerical solution of partial differential equations,
a new parallel 9-point block iterative method is developed. Here, the
organization of the blocks is done in such a way that each process
contains its own group of 9 points on the network, therefore, they can
be run in parallel. The parallel implementation of both 9-point and 4-
point block iterative methods were programmed using natural and redblack
ordering with synchronous and asynchronous approaches. The
results obtained for these different implementations were compared and
analysed.
Next the parallel version of the A.G.E. (Alternating Group Explicit)
method is developed in which the explicit nature of the difference
equation is revealed and exploited when applied to derive the solution
of both linear and non-linear 2-point boundary value problems. Two
strategies have been used in the implementation of the parallel A.G.E.
method using the synchronous and asynchronous approaches. The results
from these implementations were compared. Also for comparison reasons
the results obtained from the parallel A.G.E. were compared with the ~
corresponding results obtained from the parallel versions of the Jacobi,
Gauss-Seidel and S.O.R. methods. Finally, a computational complexity
analysis of the parallel A.G.E. algorithms is included.
In the area of non-numeric algorithms, the problems of sorting and
searching were studied. The sorting methods which were investigated
was the shell and the digit sort methods. with each method different
parallel strategies and approaches were used and compared to find the
best results which can be obtained on the parallel machine.
In the searching methods, the sequential search algorithm in an
unordered table and the binary search algorithms were investigated and
implemented in parallel with a presentation of the results. Finally,
a complexity analysis of these methods is presented.
The thesis concludes with a chapter summarizing the main results