6 research outputs found

    Performance of FORTRAN floating-point operations on the Flex/32 multicomputer

    Get PDF
    A series of experiments has been run to examine the floating-point performance of FORTRAN programs on the Flex/32 (Trademark) computer. The experiments are described, and the timing results are presented. The time required to execute a floating-point operation is found to vary considerbaly depending on a number of factors. One factor of particular interest from an algorithm design standpoint is the difference in speed between common memory accesses and local memory accesses. Common memory accesses were found to be slower, and guidelines are given for determinig when it may be cost effective to copy data from common to local memory

    Automated problem scheduling and reduction of synchronization delay effects

    Get PDF
    It is anticipated that in order to make effective use of many future high performance architectures, programs will have to exhibit at least a medium grained parallelism. A framework is presented for partitioning very sparse triangular systems of linear equations that is designed to produce favorable preformance results in a wide variety of parallel architectures. Efficient methods for solving these systems are of interest because: (1) they provide a useful model problem for use in exploring heuristics for the aggregation, mapping and scheduling of relatively fine grained computations whose data dependencies are specified by directed acrylic graphs, and (2) because such efficient methods can find direct application in the development of parallel algorithms for scientific computation. Simple expressions are derived that describe how to schedule computational work with varying degrees of granularity. The Encore Multimax was used as a hardware simulator to investigate the performance effects of using the partitioning techniques presented in shared memory architectures with varying relative synchronization costs

    Principles for problem aggregation and assignment in medium scale multiprocessors

    Get PDF
    One of the most important issues in parallel processing is the mapping of workload to processors. This paper considers a large class of problems having a high degree of potential fine grained parallelism, and execution requirements that are either not predictable, or are too costly to predict. The main issues in mapping such a problem onto medium scale multiprocessors are those of aggregation and assignment. We study a method of parameterized aggregation that makes few assumptions about the workload. The mapping of aggregate units of work onto processors is uniform, and exploits locality of workload intensity to balance the unknown workload. In general, a finer aggregate granularity leads to a better balance at the price of increased communication/synchronization costs; the aggregation parameters can be adjusted to find a reasonable granularity. The effectiveness of this scheme is demonstrated on three model problems: an adaptive one-dimensional fluid dynamics problem with message passing, a sparse triangular linear system solver on both a shared memory and a message-passing machine, and a two-dimensional time-driven battlefield simulation employing message passing. Using the model problems, the tradeoffs are studied between balanced workload and the communication/synchronization costs. Finally, an analytical model is used to explain why the method balances workload and minimizes the variance in system behavior

    A Parallel Processor System for Nuclear Shell-Model Calculations

    Get PDF
    This thesis describes the design and implementation of a dedicated parallel processor system for nuclear shell-model calculations. The purpose of these calculations is to determine nuclear energy eigenvalues by the tridiagonalisation of the nuclear Hamiltonian matrix using the Lanczos method. The Theoretical Nuclear Structure group at Glasgow University's Physics Department would normally perform this type of calculation on a high-performance main-frame computer. However these machines have limitations which restrict the number and scope of the calculations that can be performed. The Shell Model Processor system consists of a Multiple Microprocessor Unit (MMPU) driven by a highly pipelined dedicated front-end processor. The MMPU has a modular, moderately coupled, MIMD architecture based on autonomous processing modules. The elements within the system communicate via three shared buses. The front-end is responsible for determining the position of non-zero elements within the Hamiltonian matrix. Once the position of an element has been found it is passed to one of the free processing modules within the MMPU. The processing module then determines the value of the matrix element and performs the appropriate arithmetic to accumulate the resultant Lanczos vector. Two such processing modules have been developed. The most recently developed module is based on two MC68000 16/32 bit microprocessors. In addition there are two supervisory processor modules, one of which controls the front-end and also assists it in its function. The other module has privileged system capabilities and is responsible for supervising the system as a whole. The system has been successfully tested and performance figures are presented. The future expansion of the system to allow it to perform larger calculations is also discussed

    A shared memory multi-microprocessor system with hardware supported message passing mechanisms.

    Get PDF
    by Lam Chin Hung.Thesis (M.Phil.)--Chinese University of Hong Kong, 1990.Bibliography: leaves 167-174.ABSTRACT --- p.1ACKNOWLEDGEMENTS --- p.2TABLE OF CONTENTS --- p.3Chapter CHAPTER 1 --- INTRODUCTION --- p.1Chapter 1.1 --- Gaining performance with multiprocessing --- p.1Chapter 1.1.1 --- Software approach --- p.2Chapter 1.1.2 --- hardware approach --- p.2Chapter 1.2 --- Parallel processing --- p.4Chapter 1.3 --- Gaining performance with multiprocessing --- p.7Chapter 1.3.1 --- Multiprocessor configurations --- p.7Chapter 1.3.2 --- Multiprocessor design issues --- p.9Chapter 1.3.3 --- Using microprocessors --- p.11Chapter 1.3.4 --- Bus based systems --- p.12Chapter 1.4 --- Shared memory and message passing --- p.13Chapter 1.4.1 --- Shared memory --- p.13Chapter 1.4.2 --- Message passing --- p.14Chapter 1.4.3 --- Comparisons of the two paradigms --- p.16Chapter 1.5 --- Summary and comment --- p.19Chapter CHAPTER 2 --- AN OVERVIEW OF COMMON APPROACHES --- p.20Chapter 2.1 --- SUPRENUM --- p.20Chapter 2.2 --- MEMSY --- p.22Chapter 2.3 --- ELXSI --- p.24Chapter 2.4 --- Sequent --- p.25Chapter 2.5 --- YACKOS --- p.26Chapter 2.6 --- Summary --- p.30Chapter CHAPTER 3 --- THE MPC APPROACH --- p.32Chapter 3.1 --- A shared memory multiprocessor architecture --- p.32Chapter 3.2 --- Message passer for inter-process communication --- p.32Chapter 3.2.1 --- A review of the message passer approach --- p.33Chapter 3.2.2 --- Pit-falls of the message passer approach --- p.34Chapter 3.3 --- The role of the MPC --- p.35Chapter 3.3.1 --- The quest for the MPC --- p.35Chapter 3.3.2 --- Duties of the MPC --- p.37Chapter 3.3.2.1 --- Software aspects --- p.37Chapter 3.3.2.2 --- Hardware aspects --- p.40Chapter 3.4 --- Advantages and disadvantages --- p.41Chapter 3.4.1 --- Advantages --- p.41Chapter 3.4.2 --- Disadvantages --- p.43Chapter 3.4.3 --- Other discussions --- p.44Chapter 3.5 --- Summary --- p.44Chapter CHAPTER 4 --- THE DESIGN OF SM3 --- p.46Chapter 4.1 --- Introduction to SM3 --- p.45Chapter 4.2 --- Software aspects --- p.47Chapter 4.2.1 --- Programming model --- p.48Chapter 4.2.1.1 --- Logical entities --- p.48Chapter 4.2.1.2 --- Communication procedure --- p.48Chapter 4.2.2 --- Message structure --- p.51Chapter 4.2.2.1 --- Broadcast versus point-to-point messages --- p.52Chapter 4.2.2.2 --- Message priority --- p.52Chapter 4.2.2.3 --- Blocking versus non-blocking --- p.53Chapter 4.3 --- Hardware aspects --- p.55Chapter 4.3.1 --- Overall architecture --- p.55Chapter 4.3.2 --- The host machineChapter 4.3.3 --- Slave processor nodes --- p.57Chapter 4.3.4 --- The MPC --- p.59Chapter 4.4 --- Communication protocols --- p.60Chapter 4.4.1 --- Short and long messages --- p.60Chapter 4.4.2 --- Point-to-point messages --- p.61Chapter 4.4.3 --- 1-to-N DMA for broadcast messages --- p.63Chapter 4.4.3.1 --- Introducing 1-to-N DMA --- p.63Chapter 4.4.3.2 --- 1-to-N DMA operation --- p.64Chapter 4.4.3.3 --- Merits and demerits of 1-to-N DMA --- p.67Chapter 4.5 --- Summary --- p.68Chapter CHAPTER 5 --- IMPLEMENTATION ISSUES OF SM3 --- p.70Chapter 5.1 --- The shared bus - VMEbus --- p.70Chapter 5.1.1 --- Why VMEbus --- p.70Chapter 5.1.2 --- Customizing the VMEbus --- p.71Chapter 5.2 --- The host machine --- p.71Chapter 5.3 --- Slave processor nodes --- p.72Chapter 5.3.1 --- Overview of a PN --- p.74Chapter 5.3.2 --- The MC68030 microprocessor --- p.77Chapter 5.3.3 --- The DMAC M68442 --- p.78Chapter 5.3.4 --- Registers --- p.79Chapter 5.3.5 --- Shared-bus interface --- p.80Chapter 5.3.6 --- Communication logic --- p.80Chapter 5.4 --- The MPC --- p.80Chapter 5.4.1 --- Overview of the MPC --- p.81Chapter 5.4.2 --- Registers --- p.81Chapter 5.4.3 --- Communication logic --- p.83Chapter 5.5 --- Protocol implementation --- p.84Chapter 5.5.1 --- Point-to-point messages --- p.84Chapter 5.5.2 --- Broadcast messages --- p.86Chapter 5.5.2.1 --- Circular buffer queue --- p.87Chapter 5.5.2.2 --- Participating entities --- p.87Chapter 5.5.2.3 --- Protocol details --- p.88Chapter 5.6 --- System start-up procedure --- p.94Chapter 5.6.1 --- Power up reset of PNs --- p.94Chapter 5.6.2 --- Initialization of the processor pool --- p.95Chapter 5.7 --- Summary --- p.95Chapter CHAPTER 6 --- APPLICATION EXAMPLES --- p.96Chapter 6.1 --- Introduction --- p.96Chapter 6.2 --- Matrix Multiplication --- p.96Chapter 6.3 --- Parallel Quicksort --- p.97Chapter 6.4 --- Pipeline Problems --- p.99Chapter CHAPTER 7 --- UNSOLVED PROBLEMS AND FUTURE DEVELOPMENT --- p.101Chapter 7.1 --- Current Status --- p.101Chapter 7.2 --- Possible immediate enhancements --- p.102Chapter 7.2.1 --- Enhancement to the PNs --- p.102Chapter 7.2.2 --- Enhancement of the MPC --- p.103Chapter 7.2.3 --- Communication kernel enhancement --- p.103Chapter 7.3 --- Limitation of a shared bus --- p.104Chapter 7.4 --- Number crunching capability --- p.105Chapter 7.5 --- Parallel programming environment --- p.105Chapter 7.5.1 --- Conform to serial language --- p.105Chapter 7.5.2 --- Moving to parallel programming languages --- p.106Chapter 7.5.2.1 --- Uni-processor Unix --- p.107Chapter 7.5.2.2 --- Porting Unix --- p.108Chapter 7.5.2.3 --- Multiprocessor Unix --- p.108Chapter 7.5.3 --- Object-oriented approach --- p.110Chapter 7.6 --- Summary --- p.112Chapter CHAPTER 8 --- CONCLUSION --- p.113Chapter 8.1 --- Thesis summary --- p.113Chapter 8.2 --- Author's comment --- p.114Chapter 8.3 --- Looking into the future --- p.116Chapter APPENDIX A --- BLOCK DIAGRAM --- p.117Chapter APPENDIX B --- CIRCUIT DIAGRAMS --- p.119Chapter APPENDIX C --- PCB LAYOUT --- p.126Chapter APPENDIX D --- VMEBUS ADDRESS MAP --- p.132Chapter APPENDIX E --- PROCESSOR NODE ADDRESS MAP --- p.133Chapter APPENDIX F --- REGISTER LAYOUT --- p.134Chapter F.1 --- Registers on a PN --- p.134Chapter F.2 --- Registers on the MPC --- p.134Chapter APPENDIX G --- PAL DESIGN --- p.136Chapter APPENDIX H --- COMMUNICATION SUB-BUS --- p.146Chapter H.1 --- Signal definition --- p.146Chapter H.2 --- Pin assignment --- p.146Chapter APPENDIX I --- FEASIBILITY OF TASK DISTRIBUTION PLAN --- p.147Chapter APPENDIX J --- COMMUNICATION PRIMITIVES --- p.148Chapter APPENDIX K --- PHOTOGRAPHS OF SM3 --- p.150Chapter APPENDIX L --- PROTOCOL STATE DIAGRAMS --- p.152Chapter L.1 --- Predefined partial state diagrams --- p.152Chapter L.2 --- Point-to-point messages --- p.152Chapter L.3 --- Broadcast messages --- p.154Chapter APPENDIX M --- BOOT-UP PROCEDURE OF SM3 --- p.159PUBLICATIONS --- p.161REFERENCES --- p.16
    corecore