76 research outputs found
Submicron Systems Architecture Project : Semiannual Technical Report
The Mosaic C is an experimental fine-grain multicomputer
based on single-chip nodes. The Mosaic C chip includes 64KB of fast dynamic RAM,
processor, packet interface, ROM for bootstrap and self-test, and a two-dimensional selftimed
router. The chip architecture provides low-overhead and low-latency handling of
message packets, and high memory and network bandwidth. Sixty-four Mosaic chips are
packaged by tape-automated bonding (TAB) in an 8 x 8 array on circuit boards that can, in
turn, be arrayed in two dimensions to build arbitrarily large machines. These 8 x 8 boards are
now in prototype production under a subcontract with Hewlett-Packard. We are planning
to construct a 16K-node Mosaic C system from 256 of these boards. The suite of Mosaic
C hardware also includes host-interface boards and high-speed communication cables. The
hardware developments and activities of the past eight months are described in section 2.1.
The programming system that we are developing for the Mosaic C is based on the
same message-passing, reactive-process, computational model that we have used with earlier
multicomputers, but the model is implemented for the Mosaic in a way that supports finegrain
concurrency. A process executes only in response to receiving a message, and may in
execution send messages, create new processes, and modify its persistent variables before
it either exits or becomes dormant in preparation for receiving another message. These
computations are expressed in an object-oriented programming notation, a derivative of
C++ called C+-. The computational model and the C+- programming notation are
described in section 2.2. The Mosaic C runtime system, which is written in C+-, provides
automatic process placement and highly distributed management of system resources. The
Mosaic C runtime system is described in section 2.3
The Use of Parallel Processing in VLSI Computer-Aided Design Application
Coordinated Science Laboratory was formerly known as Control Systems LaboratorySemiconductor Research Corporation / 87-DP-10
An empirical evaluation of techniques for parallel simulation of message passing networks
209 p.[EN]In the field of computer design, simulation is an essential tool to validate and evaluate architectural proposals. Conventional simulation techniques, designed for their use in sequential computers, are too slow if the system to simulate is large or complex. The aim of this work is to search for techniques to accelerate simulations exploiting the parallelism available in current, commercial multicomputers, and to use these techniques to study a model of a message router. This router has been designed to constitute the communication infrastructure of a (hypothetical) massively parallel computer.
Three parallel simulation techniques have been considered: synchronous, asynchronous-conservative and asynchronous-optimistic. These algorithms have been implemented in three multicomputers: a transputer-based Supernode, an Intel Paragon and a network of workstations. The influence that factors such as the characteristics of the simulated models, the organization of the simulators and the characteristics of the target multicomputers have in the performance of the simulations has been measured and characterized.
It is concluded that optimistic parallel simulation techniques are not suitable for the considered kind of models, although they may provide good performance in other environments. A network of workstations is not the right platform for our experiments, because the communication demands of the parallel simulators surpass the abilities of local area networks—the granularity is too fine. Synchronous and conservative parallel simulation techniques perform very well in the Supernode and in the Paragon, specially if the model to simulate is complex or large—precisely the worst case for traditional, sequential simulators. This way, studies previously considered as unrealizable, due to their exceedingly high computational cost, can be performed in reasonable times. Additionally, the spectrum of possibilities of using multicomputers can be broadened to execute more than numeric applications.[ES]En el ámbito del diseño de computadores, la simulación es una herramienta imprescindible para la validación y evaluación de cualquier propuesta arquitectónica. Las ténicas convencionales de simulación, diseñadas para su utilización en computadores secuenciales, son demasiado lentas si el sistema a simular es grande o complejo. El objetivo de esta tesis es buscar técnicas para acelerar estas simulaciones, aprovechando el paralelismo disponible en multicomputadores comerciales, y usar esas técnicas para el estudio de un modelo de encaminador de mensajes. Este encaminador está diseñado para formar infraestructura de comunicaciones de un hipotético computador masivamente paralelo.
En este trabajo se consideran tres técnicas de simulación paralela: sÃncrona, asÃncrona-conservadora y asÃncrona-optimista. Estos algoritmos se han implementado en tres multicomputadores: un Supernode basado en Transputers, un Intel Paragon y una red de estaciones de trabajo. Se caracteriza la influencia que tienen en las prestaciones de los simuladores aspectos tales como los parámetros del modelo simulado, la organización del simulador y las caracterÃsticas del multicomputador utilizado.
Se concluye que las técnicas de simulación paralela optimista no resultan adecuadas para trabajar con el modelo considerado, aunque pueden ofrecer un buen rendimiento en otros entornos. La red de estaciones de trabajo no resulta una plataforma apropiada para estas simulaciones, ya que una red local no reúne condiciones para la ejecución de aplicaciones paralelas de grano fino. Las técnicas de simulación paralela sÃncrona y conservadora dan muy buenos resultados en el Supernode y en el Paragon, especialmente si el modelo a simular es complejo o grande—precisamente el peor caso para los algoritmos secuenciales. De esta forma, estudios previamente considerados inviables, por ser demasiado costosos computacionalmente, pueden realizarse en tiempos razonables. Además, se amplÃa el espectro de posibilidades de los multicomputadores, utilizándolos para algo más que aplicaciones numéricas.Este trabajo ha sido parcialmente subvencionado por la Comisión Interministerial de Ciencia y TecnologÃa, bajo contrato TIC95-037
Recommended from our members
Cherub: A hardware distributed single shared address space memory architecture
Increased computer throughput can be achieved through the use of parallel processing. The granularity of a parallel program is the average number of instructions performed by the tasks constituting it. Coarse-grained programs typically execute huge numbers of instructions per task (w 105). The tasks in fine-grained programs are typically short (æ 103). In general, the finer the program grain, the greater the potential for exploiting parallelism. Amdahl’s Law shows that in the absence of overheads, the more potential parallelism that is realised in an algorithm, the faster it will be. The economical granularity of tasks is determined by the intertask communications overhead. Break-even occurs when processing is approximately equally divided between useful work and overhead.
The two common parallel programming paradigms are shared variable and message passing. Shared variable is, in general, the more natural of the two as it allows implicit communication between tasks. This encourages the programmer to make use of fine-grained tasks. The message passing paradigm requires explicit communication between tasks. This encourages the programmer to use coarser-grained tasks.
Two kinds of parallel architecture have become established. The first is the multiprocessor, which is built around a shared bus giving broadcast communications and a shared memory. This is characterised by low communications overhead, but limited scalability. The second is the multicomputer, which is based on point-to-point communications with larger communications overhead, but good scalability. Quantitatively, the low overhead of the multiprocessor is well matched to fine-grain tasks and, hence, to supporting the shared variable paradigm, while the high overhead of the multicomputer matches it to coarse-grain parallelism and, hence, to the message passing paradigm.
Currently, there appears to be no middle ground in parallel computing; an architecture which can support both several hundred medium-grained (« 104 instructions) parallel tasks and the shared variable programming paradigm would be advantageous in many applications.
This thesis asserts that it is possible to implement a new computer architecture, Cherub, which has at least 200 processors and is able to support shared variable programming with an optimal task granularity of around 104 instructions. This can be achieved through the combination of a hardware-based distributed shared single address space and a wafer-scale communications network.
To support the thesis, the dissertation first specifies a programmer’s interface to Cherub which is simple enough to implement in hardware. It then designs algorithms which provide this interface, allowing the requirements of the underlying network to be estimated. Finally, a wafer scale communications network is outlined, and simulations are used to demonstrate that it can provide the performance required to successfully implement Cherub
Experimental analysis of computer system dependability
This paper reviews an area which has evolved over the past 15 years: experimental analysis of computer system dependability. Methodologies and advances are discussed for three basic approaches used in the area: simulated fault injection, physical fault injection, and measurement-based analysis. The three approaches are suited, respectively, to dependability evaluation in the three phases of a system's life: design phase, prototype phase, and operational phase. Before the discussion of these phases, several statistical techniques used in the area are introduced. For each phase, a classification of research methods or study topics is outlined, followed by discussion of these methods or topics as well as representative studies. The statistical techniques introduced include the estimation of parameters and confidence intervals, probability distribution characterization, and several multivariate analysis methods. Importance sampling, a statistical technique used to accelerate Monte Carlo simulation, is also introduced. The discussion of simulated fault injection covers electrical-level, logic-level, and function-level fault injection methods as well as representative simulation environments such as FOCUS and DEPEND. The discussion of physical fault injection covers hardware, software, and radiation fault injection methods as well as several software and hybrid tools including FIAT, FERARI, HYBRID, and FINE. The discussion of measurement-based analysis covers measurement and data processing techniques, basic error characterization, dependency analysis, Markov reward modeling, software-dependability, and fault diagnosis. The discussion involves several important issues studies in the area, including fault models, fast simulation techniques, workload/failure dependency, correlated failures, and software fault tolerance
ProperCAD II: A Run-Time Library for Portable, Parallel, Object-Oriented Programming with Applications to VLSI CAD
Coordinated Science Laboratory was formerly known as Control Systems LaboratorySemiconductor Research Corporation / grant 93-DP-10
C-MOS array design techniques: SUMC multiprocessor system study
The current capabilities of LSI techniques for speed and reliability, plus the possibilities of assembling large configurations of LSI logic and storage elements, have demanded the study of multiprocessors and multiprocessing techniques, problems, and potentialities. Evaluated are three previous systems studies for a space ultrareliable modular computer multiprocessing system, and a new multiprocessing system is proposed that is flexibly configured with up to four central processors, four 1/0 processors, and 16 main memory units, plus auxiliary memory and peripheral devices. This multiprocessor system features a multilevel interrupt, qualified S/360 compatibility for ground-based generation of programs, virtual memory management of a storage hierarchy through 1/0 processors, and multiport access to multiple and shared memory units
- …