36 research outputs found

    A note on implementing combining networks

    Get PDF
    In shared-memory multiprocessors, combining networks serve to eliminate hot spots due to concurrent access to the same memory location. Examples are the NYU Ultracomputer, the IBM RP3 and the Fluent Machine. We present a problem that occurs when one tries to implement the Fluent Machine`s network nodes with network chips that do not know their position within the network. We formulate the problem mathematically and present two solutions. The first solution requires some additional hardware around nodes that can be put outside network chips. The second solution requires a minor modification of the routing algorithm, but one can prove that there is no performance loss

    Fast parallel permutation algorithms

    Get PDF
    We investigate the problem of permuting n data items on an EREW PRAM with p processors using little additional storage. We present a simple algorithm with run time O((n/p)log n) and an improved algorithm with run time O(n/p+log nloglog(n/p)). Both algorithms require n additional global bits and O(1) local storage per processor. If prefix summation is supported at the instruction level, the run time of the improved algorithm is O(n/p). The algorithms can be used to rehash the address space of a PRAM emulation

    Performance of MP3D on the SB-PRAM Prototype

    Full text link

    Expanded delta networks for very large parallel computers

    Get PDF
    In this paper we analyze a generalization of the traditional delta network, introduced by Patel [21], and dubbed Expanded Delta Network (EDN). These networks provide in general multiple paths that can be exploited to reduce contention in the network resulting in increased performance. The crossbar and traditional delta networks are limiting cases of this class of networks. However, the delta network does not provide the multiple paths that the more general expanded delta networks provide, and crossbars are to costly to use for large networks. The EDNs are analyzed with respect to their routing capabilities in the MIMD and SIMD models of computation.The concepts of capacity and clustering are also addressed. In massively parallel SIMD computers, it is the trend to put a larger number processors on a chip, but due to I/O constraints only a subset of the total number of processors may have access to the network. This is introduced as a Restricted Access Expanded Delta Network of which the MasPar MP-1 router network is an example

    Parallel software caches

    Get PDF
    We investigate the construction and application of parallel software caches in shared memory multiprocessors. In contrast to maintaining a private cache for each thread, a parallel cache allows the re-use of results of lengthy computations by other threads. This is especially important in irregular applications where the re-use of intermediate results by scheduling is not possible. Example applications are the computation of intersections between a scanline and a polygon in computational geometry, and the computation of intersections between rays and objects in ray tracing. A parallel software cache is based on a readers/writers lock, i.e. as long as no thread alters the cache data structure, multiple threads may read simultaneously. If a thread wants to alter the cache because of a cache miss, it waits until all other threads have left the data structure, then it can update the contents of the cache. Other threads can access the cache only after the writer has finished its work. To increase utilization, the cache has a number of slots that can be locked separately. We investigate the tradeoff between slot size, search time in the cache, and the time to re-compute a cache entry. Another major difference between sequential and parallel software caches is the replacement strategy. We adapt classic replacement strategies such as LRU and random replacement for parallel caches. As execution platform, we use the SB-PRAM, but the concepts might be portable to machines such as NYU Ultracomputer, Tera MTA, and Stanford DASH

    PROTOTYPING THE SIMULATION OF A GATE LEVEL LOGIC APPLICATION PROGRAM INTERFACE (API) ON AN EXPLICIT-MULTI-THREADED (XMT) COMPUTER

    Get PDF
    Explicit-multi-threading (XMT) is a parallel programming approach for exploiting on-chip parallelism. Its fine-grained SPMD programming model is suitable for many computing intensive applications. In this paper, we present a parallel gate level logic simulation algorithm and study its implementation on an XMT processor. The test results show that hundreds-fold speedup can be achieved

    Simulações de arquiteturas de memória cache para o multiprocessador Multiplus

    Get PDF
    This paper analyses some alternatives for the MULTIPLUS cache memory system architecture. MULTIPLUS is a high performance multiprocessor system under development at NCE/UFRJ. The analysis is carried out using a simulator which supports different cache memory architecture configurations. The simulator experiments where done under 3 different situations: a non-cache system and th use of write back and write through cache control policies. The graphical simulation results show the system behaviour in relation to the average ratio of bus occupation and the average processor cycle length.Este trabalho analisa algumas alternativas de arquitetura de sistemas de memórias cache para o MULTIPLUS, um multiprocessador de alto desempenho em desenvolvimento no NCE/UFRJ. A análise é feita através do uso de um simulador capaz de suportar diferentes configurações de arquitetura de memória cache. As simulações foram realizadas considerando 3 situações distintas: a ausência de memórias cache e o uso de políticas de write back e write through para controle da cache. Os resultados das simulações mostram de forma gráfica o comportamento do sistema em relação à taxa média de ocupação dos barramentos e duração média dos ciclos de processador

    Safe self-scheduling: A parallel loop scheduling scheme for shared-memory multiprocessors

    Get PDF
    The article of record as published may be found at https://doi.org/10.1007/BF02577870In this paper we present Safe Self-Scheduling (SSS), a new scheduling scheme that schedules parallel loops with variable length iteration execution times not known at compile time. The scheme assumes a shared memory space. SSS combines static scheduling with dynamic scheduling and draws favorable advantages from each. First, it reduces the dynamic scheduling overhead by statistically scheduling a major portion of loop iterations. Second, the workload is balanced with simple and efficient self-scheduling scheme by applying a new measure, the smallest critical chore size. Experimental results comparing SSS with other scheduling schemes indicate that SSS surpasses other scheduling schemes. In the experiment on Gauss-Jordan, an application that is suitable for static scheduling schemes, SSS is the only self-scheduling scheme that outperforms the static scheduling scheme. This indicates that SSS achieves a balanced workload with a very small amount of overhead.USDO
    corecore