23 research outputs found
GPRM: a high performance programming framework for manycore processors
Processors with large numbers of cores are becoming commonplace. In order to utilise the
available resources in such systems, the programming paradigm has to move towards increased parallelism. However, increased parallelism does not necessarily lead to better performance. Parallel programming models have to provide not only flexible ways of defining
parallel tasks, but also efficient methods to manage the created tasks. Moreover, in a general-purpose system, applications residing in the system compete for the shared resources. Thread
and task scheduling in such a multiprogrammed multithreaded environment is a significant challenge.
In this thesis, we introduce a new task-based parallel reduction model, called the Glasgow Parallel Reduction Machine (GPRM). Our main objective is to provide high performance while maintaining ease of programming. GPRM supports native parallelism; it provides a modular way of expressing parallel tasks and the communication patterns between them. Compiling a GPRM program results in an Intermediate Representation (IR) containing useful information about tasks, their dependencies, as well as the initial mapping information. This compile-time information helps reduce the overhead of runtime task scheduling and is key to high performance. Generally speaking, the granularity and the number of tasks are major factors in achieving high performance. These factors are even more important in the case of GPRM, as it is highly dependent on tasks, rather than threads.
We use three basic benchmarks to provide a detailed comparison of GPRM with Intel OpenMP, Cilk Plus, and Threading Building Blocks (TBB) on the Intel Xeon Phi, and with GNU OpenMP on the Tilera TILEPro64. GPRM shows superior performance in almost all cases, only by controlling the number of tasks. GPRM also provides a low-overhead mechanism, called “Global Sharing”, which improves performance in multiprogramming situations.
We use OpenMP, as the most popular model for shared-memory parallel programming as the main GPRM competitor for solving three well-known problems on both platforms: LU factorisation of Sparse Matrices, Image Convolution, and Linked List Processing. We focus on proposing solutions that best fit into the GPRM’s model of execution. GPRM outperforms OpenMP in all cases on the TILEPro64. On the Xeon Phi, our solution for the LU Factorisation results in notable performance improvement for sparse matrices with large numbers of small blocks. We investigate the overhead of GPRM’s task creation and distribution for very short computations using the Image Convolution benchmark. We show that this overhead can be mitigated by combining smaller tasks into larger ones. As a result, GPRM can outperform OpenMP for convolving large 2D matrices on the Xeon Phi. Finally, we demonstrate that our parallel worksharing construct provides an efficient solution for Linked List processing and performs better than OpenMP implementations on the Xeon Phi.
The results are very promising, as they verify that our parallel programming framework for manycore processors is flexible and scalable, and can provide high performance without
sacrificing productivity
Performance counter-based strategies to improve data locality on multiprocessor systems: reordering and page migration techniques
In this dissertation we approach the study of Precise Event-Based Sampling (PEBS) techniques to improve the performance of applications on a NUMA, Itanium2-based system. We demonstrate that a low-cost, PEBS profiling can support strategies to improve the performance of an important group of computational and scientific codes in runtime. In addition, the accurate information provided by the new Event Adress Registers (EAR) of the Intel Itanium architecture helps foster the development of new data allocation strategies. Following this line, we have also developed a series of dynamic page migration PEBS strategies. Specifically, two problems are addressed: how to improve the performance of locality optimisation techniques for irregular codes in runtime, particularising for the Sparse Matrix-Vector product kernel, and how to develop strategies for dynamic page migration.
To summarise, the main contributions of this dissertation are:
1. A study of the different factors that affect the performance, as well as data and thread allocation policies, in the FinisTerrae supercomputer, the target platform in which this thesis relies on.
2. The implementation of a performance model for FinisTerrae.
3. The development of hardware counter-based strategies to assist reordering techniques
for irregular codes in order to reduce their cost and improve their behaviour.
4. The development of novel hardware counter-guided, dynamic page migration
algorithms that take advantage of the new features provided by the PEBS.
As a software contribution, we present a user-level page-migration framework to monitor,
sample and control an application in runtime
The evaluation of computer performance by means of state-dependent queueing network models
Imperial Users onl
Sharing GPUs for Real-Time Autonomous-Driving Systems
Autonomous vehicles at mass-market scales are on the horizon. Cameras are the least expensive among common sensor types and can preserve features such as color and texture that other sensors cannot. Therefore, realizing full autonomy in vehicles at a reasonable cost is expected to entail computer-vision techniques. These computer-vision applications require massive parallelism provided by the underlying shared accelerators, such as graphics processing units, or GPUs, to function “in real time.” However, when computer-vision researchers and GPU vendors refer to “real time,” they usually mean “real fast”; in contrast, certifiable automotive systems must be “real time” in the sense of being predictable. This dissertation addresses the challenging problem of how GPUs can be shared predictably and efficiently for real-time autonomous-driving systems. We tackle this challenge in four steps. First, we investigate NVIDIA GPUs with respect to scheduling, synchronization, and execution. We conduct an extensive set of experiments to infer NVIDIA GPU scheduling rules, which are unfortunately undisclosed by NVIDIA and are beyond access owing to their closed-source software stack. We also expose a list of pitfalls pertaining to CPU-GPU synchronization that can result in unbounded response times of GPU-using applications. Lastly, we examine a fundamental trade-off for designing real-time tasks under different execution options. Overall, our investigation provides an essential understanding of NVIDIA GPUs, allowing us to further model and analyze GPU tasks. Second, we develop a new model and conduct schedulability analysis for GPU tasks. We extend the well-studied sporadic task model with additional parameters that characterize the parallel execution of GPU tasks. We show that NVIDIA scheduling rules are subject to fundamental capacity loss, which implies a necessary total utilization bound. We derive response-time bounds for GPU task systems that satisfy our schedulability conditions. Third, we address an industrial challenge of supplying the throughput performance of computer-vision frameworks to support adequate coverage and redundancy offered by an array of cameras. We re-think the design of convolution neural network (CNN) software to better utilize hardware resources and achieve increased throughput (number of simultaneous camera streams) without any appreciable increase in per-frame latency (camera to CNN output) or reduction of per-stream accuracy. Fourth, we apply our analysis to a finer-grained graph scheduling of a computer-vision standard, OpenVX, which explicitly targets embedded and real-time systems. We evaluate both the analytical and empirical real-time performance of our approach.Doctor of Philosoph
SIMULATION OF A MULTIPROCESSOR COMPUTER SYSTEM
The introduction of computers and software engineering in telephone
switching systems has dictated the need for powerful design aids
for such complex systems. Among these design aids simulators -
real-time environment simulators and flat-level simulators - have
been found particularly useful in stored program controlled switching
systems design and evaluation. However, both types of simulators
suffer from certain disadvantages.
An alternative methodology to the simulation of stored program
controlled switching systems is proposed in this research. The
methodology is based on the development of a process-based multilevel
hierarchically structured software simulator. This methodology
eliminates the disadvantages of environment and flat-level simulators.
It enables the modelling of the system in a 1 to 1 transformation
process retaining the sub-systems interfaces and, hence, making it
easier to see the resemblance between the model and modelled system
and to incorporate design modifications and/or additions in the
simulator.
This methodology has been applied in building a simulation package
for the System X family of exchanges. The Processor Utility Sub-system
used to control the exchanges is first simulated, verified and validated.
The application sub-systems models are then added one level higher_,
resulting in an open-ended simulator having sub-systems models at
different levels of detail and capable of simulating any member of the
System X family of exchanges. The viability of the methodology is
demonstrated by conducting experiments to tune the real-time operating
system and by simulating a particular exchange - The Digital Main
Network Switching Centre - in order to determine its performance
characteristics.The General Electric Company Ltd,
GEC Hirst Research Cent,
Wemble
Performance measurement and evaluation of time-shared operating systems
Time-shared, virtual memory systems
are very complex and changes in their performance may
be caused by many factors - by variations in the
workload as well as changes in system configuration.
The evaluation of these systems can thus best be
carried out by linking results obtained from a
planned programme of measurements, taken on the
system, to some model of it. Such a programme of
measurements is best carried out under conditions in
which all the parameters likely to affect the system's
performance are reproducible, and under the control of
the experimenter. In order that this be possible the
workload used must be simulated and presented to the
target system through some form of automatic
workload driver. A case study of such a methodology
is presented in which the system (in this case the
Edinburgh Multi-Access System) is monitored during a
controlled experiment (designed and analysed using
standard techniques in common use in many other branches
of experimental science) and the results so obtained
used to calibrate and validate a simple simulation
model of the system. This model is then used in
further investigation of the effect of certain system parameters upon the system performance. The
factors covered by this exercise include the effect
of varying: main memory size, process loading
algorithm and secondary memory characteristics