30 research outputs found

    GPRM: a high performance programming framework for manycore processors

    Get PDF
    Processors with large numbers of cores are becoming commonplace. In order to utilise the available resources in such systems, the programming paradigm has to move towards increased parallelism. However, increased parallelism does not necessarily lead to better performance. Parallel programming models have to provide not only flexible ways of defining parallel tasks, but also efficient methods to manage the created tasks. Moreover, in a general-purpose system, applications residing in the system compete for the shared resources. Thread and task scheduling in such a multiprogrammed multithreaded environment is a significant challenge. In this thesis, we introduce a new task-based parallel reduction model, called the Glasgow Parallel Reduction Machine (GPRM). Our main objective is to provide high performance while maintaining ease of programming. GPRM supports native parallelism; it provides a modular way of expressing parallel tasks and the communication patterns between them. Compiling a GPRM program results in an Intermediate Representation (IR) containing useful information about tasks, their dependencies, as well as the initial mapping information. This compile-time information helps reduce the overhead of runtime task scheduling and is key to high performance. Generally speaking, the granularity and the number of tasks are major factors in achieving high performance. These factors are even more important in the case of GPRM, as it is highly dependent on tasks, rather than threads. We use three basic benchmarks to provide a detailed comparison of GPRM with Intel OpenMP, Cilk Plus, and Threading Building Blocks (TBB) on the Intel Xeon Phi, and with GNU OpenMP on the Tilera TILEPro64. GPRM shows superior performance in almost all cases, only by controlling the number of tasks. GPRM also provides a low-overhead mechanism, called “Global Sharing”, which improves performance in multiprogramming situations. We use OpenMP, as the most popular model for shared-memory parallel programming as the main GPRM competitor for solving three well-known problems on both platforms: LU factorisation of Sparse Matrices, Image Convolution, and Linked List Processing. We focus on proposing solutions that best fit into the GPRM’s model of execution. GPRM outperforms OpenMP in all cases on the TILEPro64. On the Xeon Phi, our solution for the LU Factorisation results in notable performance improvement for sparse matrices with large numbers of small blocks. We investigate the overhead of GPRM’s task creation and distribution for very short computations using the Image Convolution benchmark. We show that this overhead can be mitigated by combining smaller tasks into larger ones. As a result, GPRM can outperform OpenMP for convolving large 2D matrices on the Xeon Phi. Finally, we demonstrate that our parallel worksharing construct provides an efficient solution for Linked List processing and performs better than OpenMP implementations on the Xeon Phi. The results are very promising, as they verify that our parallel programming framework for manycore processors is flexible and scalable, and can provide high performance without sacrificing productivity

    Sensor web geoprocessing on the grid

    Get PDF
    Recent standardisation initiatives in the fields of grid computing and geospatial sensor middleware provide an exciting opportunity for the composition of large scale geospatial monitoring and prediction systems from existing components. Sensor middleware standards are paving the way for the emerging sensor web which is envisioned to make millions of geospatial sensors and their data publicly accessible by providing discovery, task and query functionality over the internet. In a similar fashion, concurrent development is taking place in the field of grid computing whereby the virtualisation of computational and data storage resources using middleware abstraction provides a framework to share computing resources. Sensor web and grid computing share a common vision of world-wide connectivity and in their current form they are both realised using web services as the underlying technological framework. The integration of sensor web and grid computing middleware using open standards is expected to facilitate interoperability and scalability in near real-time geoprocessing systems. The aim of this thesis is to develop an appropriate conceptual and practical framework in which open standards in grid computing, sensor web and geospatial web services can be combined as a technological basis for the monitoring and prediction of geospatial phenomena in the earth systems domain, to facilitate real-time decision support. The primary topic of interest is how real-time sensor data can be processed on a grid computing architecture. This is addressed by creating a simple typology of real-time geoprocessing operations with respect to grid computing architectures. A geoprocessing system exemplar of each geoprocessing operation in the typology is implemented using contemporary tools and techniques which provides a basis from which to validate the standards frameworks and highlight issues of scalability and interoperability. It was found that it is possible to combine standardised web services from each of these aforementioned domains despite issues of interoperability resulting from differences in web service style and security between specifications. A novel integration method for the continuous processing of a sensor observation stream is suggested in which a perpetual processing job is submitted as a single continuous compute job. Although this method was found to be successful two key challenges remain; a mechanism for consistently scheduling real-time jobs within an acceptable time-frame must be devised and the tradeoff between efficient grid resource utilisation and processing latency must be balanced. The lack of actual implementations of distributed geoprocessing systems built using sensor web and grid computing has hindered the development of standards, tools and frameworks in this area. This work provides a contribution to the small number of existing implementations in this field by identifying potential workflow bottlenecks in such systems and gaps in the existing specifications. Furthermore it sets out a typology of real-time geoprocessing operations that are anticipated to facilitate the development of real-time geoprocessing software.EThOS - Electronic Theses Online ServiceEngineering and Physical Sciences Research Council (EPSRC) : School of Civil Engineering & Geosciences, Newcastle UniversityGBUnited Kingdo

    Interactive parallel fluid solver using the Lattice-Boltzmann method and CUDA

    Get PDF
    Particle methods have been gaining momentum in the Computational Fluid Dynamics world, for their computational simplicity and inherent parallelism -- This has lead the field to jump to new hardware technologies that exploit the parallelism in a whole new way -- This project is an interactive parallel implementation of the Lattice Boltzmann Method that runs on a hyper parallel architecture as a modern GPU and uses CUDA technology for its implementation -- First the mathematical concepts of the method are introduced, then the implementation is done using the SIMD paradigm, to later on attack the visualization and interactivity simple wa

    Running parallel applications on a heterogeneous environment with accessible development practices and automatic scalability

    Get PDF
    Grid computing makes it possible to gather large quantities of resources to work on a problem. In order to exploit this potential, a framework that presents the resources to the user programmer in a form that maintains productivity is necessary. The framework must not only provide accessible development, but it must make efficient use of the resources. The Seeds framework is proposed. It uses the current Grid and distributed computing middleware to provide a parallel programming environment to a wider community of programmers. The framework was used to investigate the feasibility of scaling skeleton/pattern parallel programming into Grid computing. The research accomplished two goals: it made parallel programming on the Grid more accessible to domain­specific programmers, and it made parallel programs scale on a heterogeneous resource environ­ ment. Programming is made easier to the programmer by using skeleton and pat­ tern­based programming approaches that effectively isolate the program from the envi­ ronment. To extend the pattern approach, the pattern adder operator is proposed, imple­ mented and tested. The results show the pattern operator can reduce the number of lines of code when compared with an MPJ­Express implementation for a stencil algorithm while having an overhead of at most ten microseconds per iteration. The research in scal­ ability involved adapting existing load­balancing techniques to skeletons and patterns re­ quiring little additional configuration on the part of the programmer. The hierarchical de­ pendency concept is proposed as well, which uses a streamed data flow programming model. The concept introduces data flow computation hibernation and dependencies that can split to accommodate additional processors. The results from implementing skeleton/patterns on hierarchical dependencies show an 18.23% increase in code is neces­ sary to enable automatic scalability. The concept can increase speedup depending on the algorithm and grain size

    Reasoning about correctness properties of a coordination programming language

    Get PDF
    Safety critical systems place additional requirements to the programming language used to implement them with respect to traditional environments. Examples of features that in uence the suitability of a programming language in such environments include complexity of de nitions, expressive power, bounded space and time and veri ability. Hume is a novel programming language with a design which targets the rst three of these, in some ways, contradictory features: fully expressive languages cannot guarantee bounds on time and space, and low-level languages which can guarantee space and time bounds are often complex and thus error-phrone. In Hume, this contradiction is solved by a two layered architecture: a high-level fully expressive language, is built on top of a low-level coordination language which can guarantee space and time bounds. This thesis explores the veri cation of Hume programs. It targets safety properties, which are the most important type of correctness properties, of the low-level coordination language, which is believed to be the most error-prone. Deductive veri cation in Lamport's temporal logic of actions (TLA) is utilised, in turn validated through algorithmic experiments. This deductive veri cation is mechanised by rst embedding TLA in the Isabelle theorem prover, and then embedding Hume on top of this. Veri cation of temporal invariants is explored in this setting. In Hume, program transformation is a key feature, often required to guarantee space and time bounds of high-level constructs. Veri cation of transformations is thus an integral part of this thesis. The work with both invariant veri cation, and in particular, transformation veri cation, has pinpointed several weaknesses of the Hume language. Motivated and in uenced by this, an extension to Hume, called Hierarchical Hume, is developed and embedded in TLA. Several case studies of transformation and invariant veri cation of Hierarchical Hume in Isabelle are conducted, and an approach towards a calculus for transformations is examined.James Watt ScholarshipEngineering and Physical Sciences Research Council (EPSRC) Platform grant GR/SO177

    Runtime Management of Multiprocessor Systems for Fault Tolerance, Energy Efficiency and Load Balancing

    Get PDF
    Efficiency of modern multiprocessor systems is hurt by unpredictable events: aging causes permanent faults that disable components; application spawnings and terminations taking place at arbitrary times, affect energy proportionality, causing energy waste; load imbalances reduce resource utilization, penalizing performance. This thesis demonstrates how runtime management can mitigate the negative effects of unpredictable events, making decisions guided by a combination of static information known in advance and parameters that only become known at runtime. We propose techniques for three different objectives: graceful degradation of aging-prone systems; energy efficiency of heterogeneous adaptive systems; and load balancing by means of work stealing. Managing aging-prone systems for graceful efficiency degradation, is based on a high-level system description that encapsulates hardware reconfigurability and workload flexibility and allows to quantify system efficiency and use it as an objective function. Different custom heuristics, as well as simulated annealing and a genetic algorithm are proposed to optimize this objective function as a response to component failures. Custom heuristics are one to two orders of magnitude faster, provide better efficiency for the first 20% of system lifetime and are less than 13% worse than a genetic algorithm at the end of this lifetime. Custom heuristics occasionally fail to satisfy reconfiguration cost constraints. As all algorithms\u27 execution time scales well with respect to system size, a genetic algorithm can be used as backup in these cases. Managing heterogeneous multiprocessors capable of Dynamic Voltage and Frequency Scaling is based on a model that accurately predicts performance and power: performance is predicted by combining static, application-specific profiling information and dynamic, runtime performance monitoring data; power is predicted using the aforementioned performance estimations and a set of platform-specific, static parameters, determined only once and used for every application mix. Three runtime heuristics are proposed, that make use of this model to perform partial search of the configuration space, evaluating a small set of configurations and selecting the best one. When best-effort performance is adequate, the proposed approach achieves 3% higher energy efficiency compared to the powersave governor and 2x better compared to the interactive and ondemand governors. When individual applications\u27 performance requirements are considered, the proposed approach is able to satisfy them, giving away 18% of system\u27s energy efficiency compared to the powersave, which however misses the performance targets by 23%; at the same time, the proposed approach maintains an efficiency advantage of about 55% compared to the other governors, which also satisfy the requirements. Lastly, to improve load balancing of multiprocessors, a partial and approximate view of the current load distribution among system cores is proposed, which consists of lightweight data structures and is maintained by each core through cheap operations. A runtime algorithm is developed, using this view whenever a core becomes idle, to perform victim core selection for work stealing, also considering system topology and memory hierarchy. Among 12 diverse imbalanced workloads, the proposed approach achieves better performance than random, hierarchical and local stealing for six workloads. Furthermore, it is at most 8% slower among the other six workloads, while competing strategies incur a penalty of at least 89% on some workload

    Fault-tolerant parallel applications using a network of workstations

    Get PDF
    PhD thesisIt is becoming common to employ a Network Of Workstations, often referred to as a NOW, for general purpose computing since the allocation of an individual workstation offers good interactive response. However, there may still be a need to perform very large scale computations which exceed the resources of a single workstation. It may be that the amount of processing implies an inconveniently long duration or that the data manipulated exceeds available storage. One possibility is to employ a more powerful single machine for such computations. However, there is growing interest in seeking a cheaper alternative by harnessing the significant idle time often observed in a NOW and also possibly employing a number of workstations in parallel on a single problem. Parallelisation permits use of the combined memories of all participating workstations, but also introduces a need for communication. and success in any hardware environment depends on the amount of communication relative to the amount of computation required. In the context of a NOW, much success is reported with applications which have low communication requirements relative to computation requirements. Here it is claimed that there is reason for investigation into the use of a NOW for parallel execution of computations which are demanding in storage, potentially even exceeding the sum of memory in all available workstations. Another consideration is that where a computation is of sufficient scale, some provision for tolerating partial failures may be desirable. However, generic support for storage management and fault-tolerance in computations of this scale for a NOW is not currently available and the suitability of a NOW for solving such computations has not been investigated to any large extent. The work described here is concerned with these issues. The approach employed is to make use of an existing distributed system which supports nested atomic actions (atomic transactions) to structure fault-tolerant computations with persistent objects. This system is used to develop a fault-tolerant "bag of tasks" computation model, where the bag and shared objects are located on secondary storage. In order to understand the factors that affect the performance of large parallel computations on a NOW, a number of specific applications are developed. The performance of these applications is ana- lysed using a semi-empirical model. The same measurements underlying these performance predictions may be employed in estimation of the performance of alternative application structures. Using services provided by the distributed system referred to above, each application is implemented. The implement- ation allows verification of predicted performance and also permits identification of issues regarding construction of components required to support the chosen application structuring technique. The work demonstrates that a NOW certainly offers some potential for gain through parallelisation and that for large grain computations, the cost of implementing fault tolerance is low.Engineering and Physical Sciences Research Counci
    corecore