Search CORE

1,497 research outputs found

The M-Machine Multicomputer

Author: Carter Nicholas P.
Chang Andrew
Dally William J.
Fillo Marco
Gurevich Yevgeny
Keckler Stephen W.
Lee Whay S.
Publication venue
Publication date: 01/01/1995
Field of study

The M-Machine is an experimental multicomputer being developed to test architectural concepts motivated by the constraints of modern semiconductor technology and the demands of programming systems. The M- Machine computing nodes are connected with a 3-D mesh network; each node is a multithreaded processor incorporating 12 function units, on-chip cache, and local memory. The multiple function units are used to exploit both instruction-level and thread-level parallelism. A user accessible message passing system yields fast communication and synchronization between nodes. Rapid access to remote memory is provided transparently to the user with a combination of hardware and software mechanisms. This paper presents the architecture of the M-Machine and describes how its mechanisms maximize both single thread performance and overall system throughput

CiteSeerX

DSpace@MIT

Arithmetic on a Distributed-Memory Quantum Multicomputer

Author: ACM.
Kae Nemoto
Kohei M. Itoh
Rodney Van Meter
Szymanski T.
Thaker D. D.
W. J. Munro
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 06/03/2007
Field of study

We evaluate the performance of quantum arithmetic algorithms run on a distributed quantum computer (a quantum multicomputer). We vary the node capacity and I/O capabilities, and the network topology. The tradeoff of choosing between gates executed remotely, through ``teleported gates'' on entangled pairs of qubits (telegate), versus exchanging the relevant qubits via quantum teleportation, then executing the algorithm using local gates (teledata), is examined. We show that the teledata approach performs better, and that carry-ripple adders perform well when the teleportation block is decomposed so that the key quantum operations can be parallelized. A node size of only a few logical qubits performs adequately provided that the nodes have two transceiver qubits. A linear network topology performs acceptably for a broad range of system sizes and performance parameters. We therefore recommend pursuing small, high-I/O bandwidth nodes and a simple network. Such a machine will run Shor's algorithm for factoring large numbers efficiently.Comment: 24 pages, 10 figures, ACM transactions format. Extended version of Int. Symp. on Comp. Architecture (ISCA) paper; v2, correct one circuit error, numerous small changes for clarity, add reference

arXiv.org e-Print Archive

Crossref

Method of up-front load balancing for local memory parallel processors

Author: Baffes Paul Thomas
Publication venue
Publication date: 24/04/1990
Field of study

In a parallel processing computer system with multiple processing units and shared memory, a method is disclosed for uniformly balancing the aggregate computational load in, and utilizing minimal memory by, a network having identical computations to be executed at each connection therein. Read-only and read-write memory are subdivided into a plurality of process sets, which function like artificial processing units. Said plurality of process sets is iteratively merged and reduced to the number of processing units without exceeding the balance load. Said merger is based upon the value of a partition threshold, which is a measure of the memory utilization. The turnaround time and memory savings of the instant method are functions of the number of processing units available and the number of partitions into which the memory is subdivided. Typical results of the preferred embodiment yielded memory savings of from sixty to seventy five percent

NASA Technical Reports Server

A study of the selection of microcomputer architectures to automate planetary spacecraft power systems

Author: Nauda A.
Publication venue
Publication date
Field of study

Performance and reliability models of alternate microcomputer architectures as a methodology for optimizing system design were examined. A methodology for selecting an optimum microcomputer architecture for autonomous operation of planetary spacecraft power systems was developed. Various microcomputer system architectures are analyzed to determine their application to spacecraft power systems. It is suggested that no standardization formula or common set of guidelines exists which provides an optimum configuration for a given set of specifications

NASA Technical Reports Server

A Jacobi-based algorithm for computing symmetric eigenvalues and eigenvectors in a two-dimensional mesh

Author: González Colás Antonio María
Royo Vallés María Dolores
Valero García Miguel
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/1998
Field of study

The paper proposes an algorithm for computing symmetric eigenvalues and eigenvectors that uses a one-sided Jacobi approach and is targeted to a multicomputer in which nodes can be arranged as a two-dimensional mesh with an arbitrary number of rows and columns. The algorithm is analysed through simple analytical models of execution time, which show that an adequate choice of the mesh configuration (number of rows and columns) can improve performance significantly, with respect to a one-dimensional configuration, which is the most frequently considered scenario in current proposals. This improvement is especially noticeable in large systems.Peer ReviewedPostprint (published version

UPCommons. Portal del coneixement obert de la UPC

Parallel discrete event simulation: A shared memory approach

Author: Malony Allen D.
Mccredie Bradley D.
Reed Daniel A.
Publication venue
Publication date
Field of study

With traditional event list techniques, evaluating a detailed discrete event simulation model can often require hours or even days of computation time. Parallel simulation mimics the interacting servers and queues of a real system by assigning each simulated entity to a processor. By eliminating the event list and maintaining only sufficient synchronization to insure causality, parallel simulation can potentially provide speedups that are linear in the number of processors. A set of shared memory experiments is presented using the Chandy-Misra distributed simulation algorithm to simulate networks of queues. Parameters include queueing network topology and routing probabilities, number of processors, and assignment of network nodes to processors. These experiments show that Chandy-Misra distributed simulation is a questionable alternative to sequential simulation of most queueing network models

NASA Technical Reports Server

The effect of real workloads and stochastic workloads on the performance of allocation and scheduling algorithms in 2D mesh multicomputers

Author: Abaneh I.
Bani-Mohammad S.
Ferguson J.D.
Mackenzie L.M.
Ould-Khaoua M.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/04/2008
Field of study

The performance of the existing non-contiguous processor allocation strategies has been traditionally carried out by means of simulation based on a stochastic workload model to generate a stream of incoming jobs. To validate the performance of the existing algorithms, there has been a need to evaluate the algorithms' performance based on a real workload trace. In this paper, we evaluate the performance of several well-known processor allocation and job scheduling strategies based on a real workload trace and compare the results against those obtained from using a stochastic workload. Our results reveal that the conclusions reached on the relative performance merits of the allocation strategies when a real workload trace is used are in general compatible with those obtained when a stochastic workload is used

Crossref

Enlighten

A network flow model for load balancing in circuit-switched multicomputers

Author: Bokhari Shahid H.
Publication venue
Publication date
Field of study

In multicomputers that utilize circuit switching or wormhole routing, communication overhead depends largely on link contention - the variation due to distance between nodes is negligible. This has a major impact on the load balancing problem. In this case, there are some nodes with excess load (sources) and others with deficit load (sinks) and it is required to find a matching of sources to sinks that avoids contention. The problem is made complex by the hardwired routing on currently available machines: the user can control only which nodes communicate but not how the messages are routed. Network flow models of message flow in the mesh and the hypercube were developed to solve this problem. The crucial property of these models is the correspondence between minimum cost flows and correctly routed messages. To solve a given load balancing problem, a minimum cost flow algorithm is applied to the network. This permits one to determine efficiently a maximum contention free matching of sources to sinks which, in turn, tells one how much of the given imbalance can be eliminated without contention

NASA Technical Reports Server