10 research outputs found

    Heterogeneous programming using OpenMP and CUDA/HIP for hybrid CPU-GPU scientific applications

    Get PDF
    Hybrid computer systems combine compute units (CUs) of different nature like CPUs, GPUs and FPGAs. Simultaneously exploiting the computing power of these CUs requires a careful decomposition of the applications into balanced parallel tasks according to both the performance of each CU type and the communication costs among them. This paper describes the design and implementation of runtime support for OpenMP hybrid GPU-CPU applications, when mixed with GPU-oriented programming models (e.g. CUDA/HIP). The paper describes the case for a hybrid multi-level parallelization of the NPB-MZ benchmark suite. The implementation exploits both coarse-grain and fine-grain parallelism, mapped to compute units of different nature (GPUs and CPUs). The paper describes the implementation of runtime support to bridge OpenMP and HIP, introducing the abstractions of Computing Unit and Data Placement. We compare hybrid and non-hybrid executions under state-of-the-art schedulers for OpenMP: static and dynamic task schedulings. Then, we improve the set of schedulers with two additional variants: a memorizing-dynamic task scheduling and a profile-based static task scheduling. On a computing node composed of one AMD EPYC 7742 @ 2.250 GHz (64 cores and 2 threads/core, totalling 128 threads per node) and 2 × GPU AMD Radeon Instinct MI50 with 32 GB, hybrid executions present speedups from 1.10× up to 3.5× with respect to a non-hybrid GPU implementation, depending on the number of activated CUs.This work was supported by the Spanish Ministry of Science and Technology (PID2019-107255GB).Peer ReviewedPostprint (author's final draft

    Realizing a new paradigm in radiation therapy treatment planning

    Get PDF
    This thesis investigates the feasibility of a new IMRT planning paradigm called Interactive Dose Shaping (IDS). The IDS paradigm enables the therapist to directly impose local dose features into the therapy plan. In contrast to the conventional IMRT planning approach, IDS does not employ an objective function to drive an iterative optimization procedure. In the first part of this work, the conventional IMRT plan optimization method is investigated. Concepts for a near-optimal implementation of the planning problem are provided. The second part of this work introduces the IDS concept. It is designed to overcome clinical drawbacks of the conventional method on the one hand and to provide interactive planning strategies which exploit the full potential of modern high-performance computer hardware on the other hand. The realization of the IDS concept consists of three main parts. (1)A two-step Dose Variation and Recovery (DVR) strategy which imposes localized plan features and recovers for unintentional plan modifications elsewhere. (2)A new dose calculation method (3)The design of an IDS planning framework which provides a powerful graphical user interface. It could be shown that the IDS paradigm is able to reproduce conventionally optimized therapy plans and that the IDS concepts can be realized in real-time

    Obtaining performance and programmability using reconfigurable hardware for media processing

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, School of Architecture and Planning, Program in Media Arts and Sciences, 2002.Includes bibliographical references (p. 127-132).An imperative requirement in the design of a reconfigurable computing system or in the development of a new application on such a system is performance gains. However, such developments suffer from long-and-difficult programming process, hard-to-predict performance gains, and limited scope of applications. To address these problems, we need to understand reconfigurable hardware's capabilities and limitations, its performance advantages and disadvantages, re-think reconfigurable system architectures, and develop new tools to explore its utility. We begin by examining performance contributors at the system level. We identify those from general-purpose and those from dedicated components. We propose an architecture by integrating reconfigurable hardware within the general-purpose framework. This is to avoid and minimize dedicated hardware and organization for programmability. We analyze reconfigurable logic architectures and their performance limitations. This analysis leads to a theory that reconfigurable logic can never be clocked faster than a fixed-logic design based on the same fabrication technology. Though highly unpredictable, we can obtain a quick upper bound estimate on the clock speed based on a few parameters. We also analyze microprocessor architectures and establish an analytical performance model. We use this model to estimate performance bounds using very little information on task properties. These bounds help us to detect potential memory-bound tasks. For a compute-bound task, we compare its performance upper bound with the upper bound on reconfigurable clock speed to further rule out unlikely speedup candidates.(cont.) These performance estimates require very few parameters, and can be quickly obtained without writing software or hardware codes. They can be integrated with design tools as front end tools to explore speedup opportunities without costly trials. We believe this will broaden the applicability of reconfigurable computing.by Ling-Pei Kung.Ph.D

    Thermal Management for Dependable On-Chip Systems

    Get PDF
    This thesis addresses the dependability issues in on-chip systems from a thermal perspective. This includes an explanation and analysis of models to show the relationship between dependability and tempature. Additionally, multiple novel methods for on-chip thermal management are introduced aiming to optimize thermal properties. Analysis of the methods is done through simulation and through infrared thermal camera measurements

    A shared memory multi-microprocessor system with hardware supported message passing mechanisms.

    Get PDF
    by Lam Chin Hung.Thesis (M.Phil.)--Chinese University of Hong Kong, 1990.Bibliography: leaves 167-174.ABSTRACT --- p.1ACKNOWLEDGEMENTS --- p.2TABLE OF CONTENTS --- p.3Chapter CHAPTER 1 --- INTRODUCTION --- p.1Chapter 1.1 --- Gaining performance with multiprocessing --- p.1Chapter 1.1.1 --- Software approach --- p.2Chapter 1.1.2 --- hardware approach --- p.2Chapter 1.2 --- Parallel processing --- p.4Chapter 1.3 --- Gaining performance with multiprocessing --- p.7Chapter 1.3.1 --- Multiprocessor configurations --- p.7Chapter 1.3.2 --- Multiprocessor design issues --- p.9Chapter 1.3.3 --- Using microprocessors --- p.11Chapter 1.3.4 --- Bus based systems --- p.12Chapter 1.4 --- Shared memory and message passing --- p.13Chapter 1.4.1 --- Shared memory --- p.13Chapter 1.4.2 --- Message passing --- p.14Chapter 1.4.3 --- Comparisons of the two paradigms --- p.16Chapter 1.5 --- Summary and comment --- p.19Chapter CHAPTER 2 --- AN OVERVIEW OF COMMON APPROACHES --- p.20Chapter 2.1 --- SUPRENUM --- p.20Chapter 2.2 --- MEMSY --- p.22Chapter 2.3 --- ELXSI --- p.24Chapter 2.4 --- Sequent --- p.25Chapter 2.5 --- YACKOS --- p.26Chapter 2.6 --- Summary --- p.30Chapter CHAPTER 3 --- THE MPC APPROACH --- p.32Chapter 3.1 --- A shared memory multiprocessor architecture --- p.32Chapter 3.2 --- Message passer for inter-process communication --- p.32Chapter 3.2.1 --- A review of the message passer approach --- p.33Chapter 3.2.2 --- Pit-falls of the message passer approach --- p.34Chapter 3.3 --- The role of the MPC --- p.35Chapter 3.3.1 --- The quest for the MPC --- p.35Chapter 3.3.2 --- Duties of the MPC --- p.37Chapter 3.3.2.1 --- Software aspects --- p.37Chapter 3.3.2.2 --- Hardware aspects --- p.40Chapter 3.4 --- Advantages and disadvantages --- p.41Chapter 3.4.1 --- Advantages --- p.41Chapter 3.4.2 --- Disadvantages --- p.43Chapter 3.4.3 --- Other discussions --- p.44Chapter 3.5 --- Summary --- p.44Chapter CHAPTER 4 --- THE DESIGN OF SM3 --- p.46Chapter 4.1 --- Introduction to SM3 --- p.45Chapter 4.2 --- Software aspects --- p.47Chapter 4.2.1 --- Programming model --- p.48Chapter 4.2.1.1 --- Logical entities --- p.48Chapter 4.2.1.2 --- Communication procedure --- p.48Chapter 4.2.2 --- Message structure --- p.51Chapter 4.2.2.1 --- Broadcast versus point-to-point messages --- p.52Chapter 4.2.2.2 --- Message priority --- p.52Chapter 4.2.2.3 --- Blocking versus non-blocking --- p.53Chapter 4.3 --- Hardware aspects --- p.55Chapter 4.3.1 --- Overall architecture --- p.55Chapter 4.3.2 --- The host machineChapter 4.3.3 --- Slave processor nodes --- p.57Chapter 4.3.4 --- The MPC --- p.59Chapter 4.4 --- Communication protocols --- p.60Chapter 4.4.1 --- Short and long messages --- p.60Chapter 4.4.2 --- Point-to-point messages --- p.61Chapter 4.4.3 --- 1-to-N DMA for broadcast messages --- p.63Chapter 4.4.3.1 --- Introducing 1-to-N DMA --- p.63Chapter 4.4.3.2 --- 1-to-N DMA operation --- p.64Chapter 4.4.3.3 --- Merits and demerits of 1-to-N DMA --- p.67Chapter 4.5 --- Summary --- p.68Chapter CHAPTER 5 --- IMPLEMENTATION ISSUES OF SM3 --- p.70Chapter 5.1 --- The shared bus - VMEbus --- p.70Chapter 5.1.1 --- Why VMEbus --- p.70Chapter 5.1.2 --- Customizing the VMEbus --- p.71Chapter 5.2 --- The host machine --- p.71Chapter 5.3 --- Slave processor nodes --- p.72Chapter 5.3.1 --- Overview of a PN --- p.74Chapter 5.3.2 --- The MC68030 microprocessor --- p.77Chapter 5.3.3 --- The DMAC M68442 --- p.78Chapter 5.3.4 --- Registers --- p.79Chapter 5.3.5 --- Shared-bus interface --- p.80Chapter 5.3.6 --- Communication logic --- p.80Chapter 5.4 --- The MPC --- p.80Chapter 5.4.1 --- Overview of the MPC --- p.81Chapter 5.4.2 --- Registers --- p.81Chapter 5.4.3 --- Communication logic --- p.83Chapter 5.5 --- Protocol implementation --- p.84Chapter 5.5.1 --- Point-to-point messages --- p.84Chapter 5.5.2 --- Broadcast messages --- p.86Chapter 5.5.2.1 --- Circular buffer queue --- p.87Chapter 5.5.2.2 --- Participating entities --- p.87Chapter 5.5.2.3 --- Protocol details --- p.88Chapter 5.6 --- System start-up procedure --- p.94Chapter 5.6.1 --- Power up reset of PNs --- p.94Chapter 5.6.2 --- Initialization of the processor pool --- p.95Chapter 5.7 --- Summary --- p.95Chapter CHAPTER 6 --- APPLICATION EXAMPLES --- p.96Chapter 6.1 --- Introduction --- p.96Chapter 6.2 --- Matrix Multiplication --- p.96Chapter 6.3 --- Parallel Quicksort --- p.97Chapter 6.4 --- Pipeline Problems --- p.99Chapter CHAPTER 7 --- UNSOLVED PROBLEMS AND FUTURE DEVELOPMENT --- p.101Chapter 7.1 --- Current Status --- p.101Chapter 7.2 --- Possible immediate enhancements --- p.102Chapter 7.2.1 --- Enhancement to the PNs --- p.102Chapter 7.2.2 --- Enhancement of the MPC --- p.103Chapter 7.2.3 --- Communication kernel enhancement --- p.103Chapter 7.3 --- Limitation of a shared bus --- p.104Chapter 7.4 --- Number crunching capability --- p.105Chapter 7.5 --- Parallel programming environment --- p.105Chapter 7.5.1 --- Conform to serial language --- p.105Chapter 7.5.2 --- Moving to parallel programming languages --- p.106Chapter 7.5.2.1 --- Uni-processor Unix --- p.107Chapter 7.5.2.2 --- Porting Unix --- p.108Chapter 7.5.2.3 --- Multiprocessor Unix --- p.108Chapter 7.5.3 --- Object-oriented approach --- p.110Chapter 7.6 --- Summary --- p.112Chapter CHAPTER 8 --- CONCLUSION --- p.113Chapter 8.1 --- Thesis summary --- p.113Chapter 8.2 --- Author's comment --- p.114Chapter 8.3 --- Looking into the future --- p.116Chapter APPENDIX A --- BLOCK DIAGRAM --- p.117Chapter APPENDIX B --- CIRCUIT DIAGRAMS --- p.119Chapter APPENDIX C --- PCB LAYOUT --- p.126Chapter APPENDIX D --- VMEBUS ADDRESS MAP --- p.132Chapter APPENDIX E --- PROCESSOR NODE ADDRESS MAP --- p.133Chapter APPENDIX F --- REGISTER LAYOUT --- p.134Chapter F.1 --- Registers on a PN --- p.134Chapter F.2 --- Registers on the MPC --- p.134Chapter APPENDIX G --- PAL DESIGN --- p.136Chapter APPENDIX H --- COMMUNICATION SUB-BUS --- p.146Chapter H.1 --- Signal definition --- p.146Chapter H.2 --- Pin assignment --- p.146Chapter APPENDIX I --- FEASIBILITY OF TASK DISTRIBUTION PLAN --- p.147Chapter APPENDIX J --- COMMUNICATION PRIMITIVES --- p.148Chapter APPENDIX K --- PHOTOGRAPHS OF SM3 --- p.150Chapter APPENDIX L --- PROTOCOL STATE DIAGRAMS --- p.152Chapter L.1 --- Predefined partial state diagrams --- p.152Chapter L.2 --- Point-to-point messages --- p.152Chapter L.3 --- Broadcast messages --- p.154Chapter APPENDIX M --- BOOT-UP PROCEDURE OF SM3 --- p.159PUBLICATIONS --- p.161REFERENCES --- p.16

    On-demand distributed image processing over an adaptive Campus-Grid

    Get PDF
    This thesis explores how scientific applications, which are based upon short jobs (seconds and minutes) can capitalize upon the idle workstations of a Campus-Grid. These resources are donated on a voluntary basis, and consequently, the Campus-Grid is constantly adapting and the availability of workstations changes. Typically, to utilize these resources a Condor system or equivalent would be used. However, such systems are designed with different trade-offs and incentives in mind and therefore do not provide intrinsic support for short jobs. The motivation for creating a provisioning scenario for short jobs is that Image Processing, as well as other areas of scientific analysis, are typically composed of short running jobs, but still require parallel solutions. Much of the literature in this area comments on the challenges of performing such analysis efficiently and effectively even when dedicated resources are in use. The main challenges are: latency and scheduling penalties, granularity and the potential for very short jobs. A volunteer Grid retains these challenges but also adds further challenges. These can be summarized as: unpredictable re source availability and longevity, multiple machine owners and administrators who directly affect the operating environment. Ultimately, this creates the requirement for well conceived and effective fault management strategies. However, these are typically not in place to enable transparent fault-free job administration for the user. This research demonstrates that these challenges are answerable, and that in doing so opportunistically sourced Campus-Grid resources can host disparate applications constituted of short running jobs, of as little as one second in length. This is demonstrated by the significant improvements in performance when the system presented here was compared to a well established Condor system. Here, improvements are increased job efficiency from 60–70% to 95%–100%, up to a 99% reduction in application makespan and up to a 13000% increase in the efficiency of resource utilization. The Condor pool in use is approximately 1,600 workstations distributed across 27 administrative domains of Cardiff University. The application domain of this research is Matlab-based image processing, and the application area used to demonstrate the approach is the analysis of Magnetic Resonance Imagery (MRI). However, the presented approach is generalizable to any application domain with similar characteristics

    Ad hoc cloud computing

    Get PDF
    Commercial and private cloud providers offer virtualized resources via a set of co-located and dedicated hosts that are exclusively reserved for the purpose of offering a cloud service. While both cloud models appeal to the mass market, there are many cases where outsourcing to a remote platform or procuring an in-house infrastructure may not be ideal or even possible. To offer an attractive alternative, we introduce and develop an ad hoc cloud computing platform to transform spare resource capacity from an infrastructure owner’s locally available, but non-exclusive and unreliable infrastructure, into an overlay cloud platform. The foundation of the ad hoc cloud relies on transferring and instantiating lightweight virtual machines on-demand upon near-optimal hosts while virtual machine checkpoints are distributed in a P2P fashion to other members of the ad hoc cloud. Virtual machines found to be non-operational are restored elsewhere ensuring the continuity of cloud jobs. In this thesis we investigate the feasibility, reliability and performance of ad hoc cloud computing infrastructures. We firstly show that the combination of both volunteer computing and virtualization is the backbone of the ad hoc cloud. We outline the process of virtualizing the volunteer system BOINC to create V-BOINC. V-BOINC distributes virtual machines to volunteer hosts allowing volunteer applications to be executed in the sandbox environment to solve many of the downfalls of BOINC; this however also provides the basis for an ad hoc cloud computing platform to be developed. We detail the challenges of transforming V-BOINC into an ad hoc cloud and outline the transformational process and integrated extensions. These include a BOINC job submission system, cloud job and virtual machine restoration schedulers and a periodic P2P checkpoint distribution component. Furthermore, as current monitoring tools are unable to cope with the dynamic nature of ad hoc clouds, a dynamic infrastructure monitoring and management tool called the Cloudlet Control Monitoring System is developed and presented. We evaluate each of our individual contributions as well as the reliability, performance and overheads associated with an ad hoc cloud deployed on a realistically simulated unreliable infrastructure. We conclude that the ad hoc cloud is not only a feasible concept but also a viable computational alternative that offers high levels of reliability and can at least offer reasonable performance, which at times may exceed the performance of a commercial cloud infrastructure
    corecore