105 research outputs found

    Shared versus distributed memory multiprocessors

    Get PDF
    The question of whether multiprocessors should have shared or distributed memory has attracted a great deal of attention. Some researchers argue strongly for building distributed memory machines, while others argue just as strongly for programming shared memory multiprocessors. A great deal of research is underway on both types of parallel systems. Special emphasis is placed on systems with a very large number of processors for computation intensive tasks and considers research and implementation trends. It appears that the two types of systems will likely converge to a common form for large scale multiprocessors

    Compiling global name-space programs for distributed execution

    Get PDF
    Distributed memory machines do not provide hardware support for a global address space. Thus programmers are forced to partition the data across the memories of the architecture and use explicit message passing to communicate data between processors. The compiler support required to allow programmers to express their algorithms using a global name-space is examined. A general method is presented for analysis of a high level source program and along with its translation to a set of independently executing tasks communicating via messages. If the compiler has enough information, this translation can be carried out at compile-time. Otherwise run-time code is generated to implement the required data movement. The analysis required in both situations is described and the performance of the generated code on the Intel iPSC/2 is presented

    Rectilinear partitioning of irregular data parallel computations

    Get PDF
    New mapping algorithms for domain oriented data-parallel computations, where the workload is distributed irregularly throughout the domain, but exhibits localized communication patterns are described. Researchers consider the problem of partitioning the domain for parallel processing in such a way that the workload on the most heavily loaded processor is minimized, subject to the constraint that the partition be perfectly rectilinear. Rectilinear partitions are useful on architectures that have a fast local mesh network. Discussed here is an improved algorithm for finding the optimal partitioning in one dimension, new algorithms for partitioning in two dimensions, and optimal partitioning in three dimensions. The application of these algorithms to real problems are discussed

    Probabilistic structural mechanics research for parallel processing computers

    Get PDF
    Aerospace structures and spacecraft are a complex assemblage of structural components that are subjected to a variety of complex, cyclic, and transient loading conditions. Significant modeling uncertainties are present in these structures, in addition to the inherent randomness of material properties and loads. To properly account for these uncertainties in evaluating and assessing the reliability of these components and structures, probabilistic structural mechanics (PSM) procedures must be used. Much research has focused on basic theory development and the development of approximate analytic solution methods in random vibrations and structural reliability. Practical application of PSM methods was hampered by their computationally intense nature. Solution of PSM problems requires repeated analyses of structures that are often large, and exhibit nonlinear and/or dynamic response behavior. These methods are all inherently parallel and ideally suited to implementation on parallel processing computers. New hardware architectures and innovative control software and solution methodologies are needed to make solution of large scale PSM problems practical

    VIRTUAL MEMORY ON A MANY-CORE NOC

    Get PDF
    Many-core devices are likely to become increasingly common in real-time and embedded systems as computational demands grow and as expectations for higher performance can generally only be met by by increasing core numbers rather than relying on higher clock speeds. Network-on-chip devices, where multiple cores share a single slice of silicon and employ packetised communications, are a widely-deployed many-core option for system designers. As NoCs are expected to run larger and more complex programs, the small amount of fast, on-chip memory available to each core is unlikely to be sufficient for all but the simplest of tasks, and it is necessary to find an efficient, effective, and time-bounded, means of accessing resources stored in off-chip memory, such as DRAM or Flash storage. The abstraction of paged virtual memory is a familiar technique to manage similar tasks in general computing but has often been shunned by real-time developers because of concern about time predictability. We show it can be a poor choice for a many-core NoC system as, unmodified, it typically uses page sizes optimised for interaction with spinning disks and not solid state media, and transports significant volumes of subsequently unused data across already congested links. In this work we outline and simulate an efficient partial paging algorithm where only those memory resources that are locally accessed are transported between global and local storage. We further show that smaller page sizes add to efficiency. We examine the factors that lead to timing delays in such systems, and show we can predict worst case execution times at even safety-critical thresholds by using statistical methods from extreme value theory. We also show these results are applicable to systems with a variety of connections to memory

    Parallel sorting by regular sampling

    Get PDF
    ABSTRACT A new parallel sorting algorithm suitable for MIMD multiprocessors is presented. The algorithm reduces memory and bus contention, which many parallel sorting algorithms suffer from, by using a regular sampling of the data to ensure good pivot selection. For n data elements to be sorted and p processors, when n ≥ p 3 the algorithm is shown to be asymptotically optimal. In theory, the algorithm is within a factor of two of achieving ideal load balancing. In practice, there is almost perfect partitioning of work. On a variety of shared and distributed memory machines, the algorithm achieves better than half-linear speedups. -4

    Vienna FORTRAN: A FORTRAN language extension for distributed memory multiprocessors

    Get PDF
    Exploiting the performance potential of distributed memory machines requires a careful distribution of data across the processors. Vienna FORTRAN is a language extension of FORTRAN which provides the user with a wide range of facilities for such mapping of data structures. However, programs in Vienna FORTRAN are written using global data references. Thus, the user has the advantage of a shared memory programming paradigm while explicitly controlling the placement of data. The basic features of Vienna FORTRAN are presented along with a set of examples illustrating the use of these features

    Optimal performance of distributed simulation programs

    Get PDF
    Journal ArticleThis paper describes a technique to analyze the potential speedup of distributed simulation programs. A distributed simulation strategy is proposed which minimizes execution time through the use of an oracle to control the simulation. Because the strategy relies on an oracle, it cannot be used for practical simulations. However the strategy facilitates performance evaluations of distributed simulation strategies by providing a useful point of comparison and can be used to determine the suitability of specific applications for implementation on a parallel computer. Based on the proposed strategy, a tool has been developed to determine the maximum performance which can be achieved from a distributed simulation program. In this paper we describe the technique and its use in evaluating the parallelism available in distributed simulators of parallel computer systems

    The virtual time machine

    Get PDF
    Journal ArticleExisting multiprocessors and multicomputers require the programmer or compiler to perform data dependence analysis at compile time. We propose a parallel computer that performs this task at runtime. In particular, the Virtual Time Machine (VTM) detects violations of data dependence constraints as they occur, and automatically recovers from them. A sophisticated memory system that is addressed using both a spatial and a temporal coordinate is used to efficiently implement this mechanism. Initially targeted for discrete event simulation applications, many of the ideas used in the machine architecture have direct application in the more general realm of parallel computation. The long term goal of this work is to develop a general purpose parallel computer that will support a wide range of parallel programming paradigms. This paper outlines the motivations behind the V TM architecture, the underlying computation model, a proposed implementation, and initial performance results. A recurring theme that pervades the entire paper is our contention that existing shared memory and message-base machines do not pay adequate attention to the dimension of time. We argue that this architectural deficiency is the underlying reason behind many difficult problems in parallel computation today
    • …
    corecore