38 research outputs found

    Design of testbed and emulation tools

    Get PDF
    The research summarized was concerned with the design of testbed and emulation tools suitable to assist in projecting, with reasonable accuracy, the expected performance of highly concurrent computing systems on large, complete applications. Such testbed and emulation tools are intended for the eventual use of those exploring new concurrent system architectures and organizations, either as users or as designers of such systems. While a range of alternatives was considered, a software based set of hierarchical tools was chosen to provide maximum flexibility, to ease in moving to new computers as technology improves and to take advantage of the inherent reliability and availability of commercially available computing systems

    Parallel Architectures and Parallel Algorithms for Integrated Vision Systems

    Get PDF
    Computer vision is regarded as one of the most complex and computationally intensive problems. An integrated vision system (IVS) is a system that uses vision algorithms from all levels of processing to perform for a high level application (e.g., object recognition). An IVS normally involves algorithms from low level, intermediate level, and high level vision. Designing parallel architectures for vision systems is of tremendous interest to researchers. Several issues are addressed in parallel architectures and parallel algorithms for integrated vision systems

    An empirical evaluation of techniques for parallel simulation of message passing networks

    Get PDF
    209 p.[EN]In the field of computer design, simulation is an essential tool to validate and evaluate architectural proposals. Conventional simulation techniques, designed for their use in sequential computers, are too slow if the system to simulate is large or complex. The aim of this work is to search for techniques to accelerate simulations exploiting the parallelism available in current, commercial multicomputers, and to use these techniques to study a model of a message router. This router has been designed to constitute the communication infrastructure of a (hypothetical) massively parallel computer. Three parallel simulation techniques have been considered: synchronous, asynchronous-conservative and asynchronous-optimistic. These algorithms have been implemented in three multicomputers: a transputer-based Supernode, an Intel Paragon and a network of workstations. The influence that factors such as the characteristics of the simulated models, the organization of the simulators and the characteristics of the target multicomputers have in the performance of the simulations has been measured and characterized. It is concluded that optimistic parallel simulation techniques are not suitable for the considered kind of models, although they may provide good performance in other environments. A network of workstations is not the right platform for our experiments, because the communication demands of the parallel simulators surpass the abilities of local area networks—the granularity is too fine. Synchronous and conservative parallel simulation techniques perform very well in the Supernode and in the Paragon, specially if the model to simulate is complex or large—precisely the worst case for traditional, sequential simulators. This way, studies previously considered as unrealizable, due to their exceedingly high computational cost, can be performed in reasonable times. Additionally, the spectrum of possibilities of using multicomputers can be broadened to execute more than numeric applications.[ES]En el ámbito del diseño de computadores, la simulación es una herramienta imprescindible para la validación y evaluación de cualquier propuesta arquitectónica. Las ténicas convencionales de simulación, diseñadas para su utilización en computadores secuenciales, son demasiado lentas si el sistema a simular es grande o complejo. El objetivo de esta tesis es buscar técnicas para acelerar estas simulaciones, aprovechando el paralelismo disponible en multicomputadores comerciales, y usar esas técnicas para el estudio de un modelo de encaminador de mensajes. Este encaminador está diseñado para formar infraestructura de comunicaciones de un hipotético computador masivamente paralelo. En este trabajo se consideran tres técnicas de simulación paralela: síncrona, asíncrona-conservadora y asíncrona-optimista. Estos algoritmos se han implementado en tres multicomputadores: un Supernode basado en Transputers, un Intel Paragon y una red de estaciones de trabajo. Se caracteriza la influencia que tienen en las prestaciones de los simuladores aspectos tales como los parámetros del modelo simulado, la organización del simulador y las características del multicomputador utilizado. Se concluye que las técnicas de simulación paralela optimista no resultan adecuadas para trabajar con el modelo considerado, aunque pueden ofrecer un buen rendimiento en otros entornos. La red de estaciones de trabajo no resulta una plataforma apropiada para estas simulaciones, ya que una red local no reúne condiciones para la ejecución de aplicaciones paralelas de grano fino. Las técnicas de simulación paralela síncrona y conservadora dan muy buenos resultados en el Supernode y en el Paragon, especialmente si el modelo a simular es complejo o grande—precisamente el peor caso para los algoritmos secuenciales. De esta forma, estudios previamente considerados inviables, por ser demasiado costosos computacionalmente, pueden realizarse en tiempos razonables. Además, se amplía el espectro de posibilidades de los multicomputadores, utilizándolos para algo más que aplicaciones numéricas.Este trabajo ha sido parcialmente subvencionado por la Comisión Interministerial de Ciencia y Tecnología, bajo contrato TIC95-037

    Visualization of program performance on concurrent computers

    Get PDF
    A distributed memory concurrent computer (such as a hypercube computer) is inherently a complex system involving the collective and simultaneous interaction of many entities engaged in computation and communication activities. Program performance evaluation in concurrent computer systems requires methods and tools for observing, analyzing, and displaying system performance. This dissertation describes a methodology for collecting and displaying, via a unique graphical approach, performance measurement information from (possibly large) concurrent computer systems. Performance data are generated and collected via instrumentation. The data are then reduced via conventional cluster analysis techniques and converted into a pictorial form to highlight important aspects of program states during execution. Local and summary statistics are calculated. Included in the suite of defined metrics are measures for quantifying and comparing amounts of computation and communication. A novel kind of data plot is introduced to visually display both temporal and spatial information describing system activity. Phenomena such as hot spots of activity are easily observed, and in some cases, patterns inherent in the application algorithms being studied are highly visible. The approach also provides a framework for a visual solution to the problem of mapping a given parallel algorithm to an underlying parallel machine. A prototype implementation applied to several case studies is presented to demonstrate the feasibility and power of the approach

    Proposal and development of a highly modular and scalable self-adaptive hardware architecture with parallel processing capability

    Get PDF
    This dissertation describes a novel unconventional self-adaptive hardware architecture with capacity for parallel processing. For scalability issues, this bioinspired architecture is based on a regular array of homogeneous cells. The proposed programmable architecture implements in a distributed way self-adaptive capabilities including self-placement and self-routing which, due to its intrinsic design, enable the development of systems with runtime reconfiguration, self-repair and/or fault tolerance capabilities. The physical implementation of this architecture is composed of two-layers, interconnected cells in the first level and interconnected switch and pin matrices in the second level. The cell is the basic element of the proposed self-adaptive architecture. Any application scheduled to the system has to be organized in components, where each component is composed by one or more interconnected cells. The interconnection of cells inside a component is made at cell level (first layer), while the physical interconnections of components are made in the second layer. Additionally, two layers are defined as conceptual organization for the implementation of general purpose applications: the SANE and the SANE assembly. The Self-Adaptive Networked Entity (SANE) is composed by a group of components. This is the basic self-adaptive computing system. It has the ability to monitor its local environment and its internal computation process. The SANE-Assembly (SANE-ASM) is composed by a group of interconnected SANEs. The processing capabilities of the cell are included in its Functional Unit (FU), which can be described as a four-core configurable multicomputer. The FU includes twelve programmable configuration modes, i.e., each cell permits to select from one to four processors working in parallel, with different size of program and data memories. The self-adaptive capabilities of the cell are executed mainly by the Cell Configuration Unit (CCU). The self-placement algorithm is responsible for finding out the most suitable position in the cell array to insert the new cell of a component. The self-routing algorithm permits interconnecting the ports of the FU of two cells through the cell ports. The self-placement and self-routing processes allow for performing complex functionality changes in real time, these processes endow the system with enhanced functionality, enabling the system to change itself, this allows for the implementation of run-time self-configuration, without the need for any configuration manager. The architecture proposed includes two mechanisms of fault tolerance. One of these is the Dynamic Fault Tolerance Scaling Technique, that has the ability to create and eliminate the redundant copies of the functional section of a specific application. The other mechanism of fault tolerance is a dedicated or static Fault Tolerance System. It provides redundant processing capabilities that are working continuously. When a failure in the execution of a program is detected, the processors of the cell are stopped and the self-elimination and self-replication processes start for the cell (or cells) involved in the failure. An FPGA-based prototype and a software tool have been built for demonstration purposes. The prototype includes all the self-adaptive capabilities described in this dissertation. With the purpose of having a complete development system, the software tool SANE Project Developer (SPD) has been implemented. The SPD is an Integrated Development Environment (IDE) that allows generating the memory initialization data for the control microprocessor inside the prototype.Esta tesis doctoral describe una arquitectura de hardware auto-adaptable novedosa y no convencional con capacidad de procesamiento en paralelo. Por razones de escalabilidad, esta arquitectura bioinspirada está basada en una matriz regular de células homogéneas. La arquitectura propuesta es programable, e implementa de manera distribuida diversas capacidades auto-adaptables incluyendo el auto-emplazamiento y auto-enrutamiento, los cuales debido a su diseño intrínseco, permiten el desarrollo de sistemas reconfigurables en tiempo de ejecución, así como de sistemas autoreparables y/o con capacidades de tolerancia a fallos. La implementación física de esta arquitectura esta compuesta de dos capas, que incluyen células interconectadas en el primer nivel y matrices de conmutación y pines en el segundo nivel. La célula es el elemento básico de la arquitectura propuesta. Cualquier aplicación que se quiera programar en el sistema debe estar organizada en componentes, donde cada componente está compuesto por una o más células interconectadas. La interconexión de células dentro de un componente es realizado en el mismo nivel de la matriz de células, mientras que la interconexión de componentes es realizada en la segunda capa. Adicionalmente, se definen dos capas conceptuales que son usadas con propósitos organizativos en aplicaciones de propósito general, estas son: el SANE y el SANE-assembly (o conjunto de SANEs). La entidad auto-adaptable interconectada o SANE está compuesta por un grupo de componentes. Este es el sistema de computación auto-adaptable básico, el cual tiene la habilidad de monitorizar su entorno local y su proceso de computación interno. Las capacidades de procesamiento de la célula están incluidas en su unidad funcional (FU). Esta puede ser definida como un multicomputador configurable con cuatro núcleos, los cuales son agrupados o no dependiendo del modo de configuración. La FU tiene doce modos de configuración programables, por lo que cada célula permite seleccionar entre uno y cuatro procesadores trabajando en paralelo con diversas capacidades en las memorias de programa y datos. Las capacidades auto-adaptables de la célula son ejecutadas principalmente por la unidad de configuración de la célula (CCU). El algoritmo de auto-emplazamiento es el encargado de encontrar la posición mas adecuada dentro de la matriz de células para insertar la nueva célula de un componente. El algoritmo de auto-enrutamiento permite interconectar los puertos de las FU de dos células. Los procesos de auto-emplazamiento y auto-enrutamiento permiten realizar en tiempo real cambios funcionales complejos; estos procesos dotan al sistema de una mayor funcionalidad, permitiendo que el sistema cambie por si mismo, lo que permite la implementación de la auto-configuración en tiempo real, sin la necesidad de ningún gestor de configuración. La arquitectura propuesta incluye dos mecanismos de tolerancia a fallos. Uno de estos es una técnica escalonada y dinámica de tolerancia a fallos, que tiene la habilidad de crear y eliminar copias redundantes de la unidad funcional (o de cómputo) de una aplicación específica. El otro mecanismo de tolerancia a fallos es el Sistema de Tolerancia a Fallos dedicado o estático. Este provee capacidades de procesamiento redundante que están en funcionamiento continuamente. Cuando un fallo en la ejecución de un programa es detectado, los procesadores de la célula son detenidos y los procesos de auto-eliminación y auto-replicación se inician para la célula (o células) implicada en el fallo. Se desarrolló un prototipo basado en FPGAs y una herramienta de software para comprobar la funcionalidad del sistema. El prototipo incluye todas las características de los sistemas auto-adaptable descritas en este trabajo. El SANE Project developer (SPD) es un ambiente integrado de desarrollo (IDE) que permite generar y descargar la memoria de inicialización de datos para el Microprocesador de Control dentro del prototipo

    Automatic visual recognition using parallel machines

    Get PDF
    Invariant features and quick matching algorithms are two major concerns in the area of automatic visual recognition. The former reduces the size of an established model database, and the latter shortens the computation time. This dissertation, will discussed both line invariants under perspective projection and parallel implementation of a dynamic programming technique for shape recognition. The feasibility of using parallel machines can be demonstrated through the dramatically reduced time complexity. In this dissertation, our algorithms are implemented on the AP1000 MIMD parallel machines. For processing an object with a features, the time complexity of the proposed parallel algorithm is O(n), while that of a uniprocessor is O(n2). The two applications, one for shape matching and the other for chain-code extraction, are used in order to demonstrate the usefulness of our methods. Invariants from four general lines under perspective projection are also discussed in here. In contrast to the approach which uses the epipolar geometry, we investigate the invariants under isotropy subgroups. Theoretically speaking, two independent invariants can be found for four general lines in 3D space. In practice, we show how to obtain these two invariants from the projective images of four general lines without the need of camera calibration. A projective invariant recognition system based on a hypothesis-generation-testing scheme is run on the hypercube parallel architecture. Object recognition is achieved by matching the scene projective invariants to the model projective invariants, called transfer. Then a hypothesis-generation-testing scheme is implemented on the hypercube parallel architecture

    Network control for a multi-user transputer-based system.

    Get PDF
    A dissertation submitted to the Faculty of Engineering, University of the Witwatersrand, Johannesburg, in fulfilment of the requirements for the degree of Master of Science in EngineeringThe MC2/64 system is a configureable multi-user transputer- based system which was designed using a modular approach. The MC2/64 consists of MC2 Clusters which are connected using a modified Clos network. The MC2 Clusters were designed and realised as completely configurable modules using and extending an algorithm based on Eulerian cycles through a requested graph. This dissertation discusses the configuration algorithm and the extensions made to the algorithm for the MC2 Clusters. The total MC2/64 system is not completely configurable as a MC2 Cluster releases only a limited number of links for inter-cluster connections. This dissertation analyses the configurability of MC2/64, but also presents algorithms which enhance the usability of the system from the user's point of view. The design and the implementation of the network control software are also submitted as topics in this dissertation. The network control software must allow multiple users to use the system, but without them influencing each other's transputer domains. This dissertation therefore seeks to give an overview of network control problems and the solutions implemented in current MC2/64 systems. The results of the research done for this dissertation will hopefully aid in the design of future MC2 systems which will provide South Africa with much needed, low cost, high performance computing power.Andrew Chakane 201

    Performance Evaluation of Specialized Hardware for Fast Global Operations on Distributed Memory Multicomputers

    Get PDF
    Workstation cluster multicomputers are increasingly being applied for solving scientific problems that require massive computing power. Parallel Virtual Machine (PVM) is a popular message-passing model used to program these clusters. One of the major performance limiting factors for cluster multicomputers is their inefficiency in performing parallel program operations involving collective communications. These operations include synchronization, global reduction, broadcast/multicast operations and orderly access to shared global variables. Hall has demonstrated that a .secondary network with wide tree topology and centralized coordination processors (COP) could improve the performance of global operations on a variety of distributed architectures [Hall94a]. My hypothesis was that the efficiency of many PVM applications on workstation clusters could be significantly improved by utilizing a COP system for collective communication operations. To test my hypothesis, I interfaced COP system with PVM. The interface software includes a virtual memory-mapped secondary network interface driver, and a function library which allows to use COP system in place of PVM function calls in application programs. My implementation makes it possible to easily port any existing PVM applications to perform fast global operations using the COP system. To evaluate the performance improvements of using a COP system, I measured cost of various PVM global functions, derived the cost of equivalent COP library global functions, and compared the results. To analyze the cost of global operations on overall execution time of applications, I instrumented a complex molecular dynamics PVM application and performed measurements. The measurements were performed for a sample cluster size of 5 and for message sizes up to 16 kilobytes. The comparison of PVM and COP system global operation performance clearly demonstrates that the COP system can speed up a variety of global operations involving small-to-medium sized messages by factors of 5-25. Analysis of the example application for a sample cluster size of 5 show that speedup provided by my global function libraries and the COP system reduces overall execution time for this and similar applications by above 1.5 times. Additionally, the performance improvement seen by applications increases as the cluster size increases, thus providing a scalable solution for performing global operations

    The exploitation of parallelism on shared memory multiprocessors

    Get PDF
    PhD ThesisWith the arrival of many general purpose shared memory multiple processor (multiprocessor) computers into the commercial arena during the mid-1980's, a rift has opened between the raw processing power offered by the emerging hardware and the relative inability of its operating software to effectively deliver this power to potential users. This rift stems from the fact that, currently, no computational model with the capability to elegantly express parallel activity is mature enough to be universally accepted, and used as the basis for programming languages to exploit the parallelism that multiprocessors offer. To add to this, there is a lack of software tools to assist programmers in the processes of designing and debugging parallel programs. Although much research has been done in the field of programming languages, no undisputed candidate for the most appropriate language for programming shared memory multiprocessors has yet been found. This thesis examines why this state of affairs has arisen and proposes programming language constructs, together with a programming methodology and environment, to close the ever widening hardware to software gap. The novel programming constructs described in this thesis are intended for use in imperative languages even though they make use of the synchronisation inherent in the dataflow model by using the semantics of single assignment when operating on shared data, so giving rise to the term shared values. As there are several distinct parallel programming paradigms, matching flavours of shared value are developed to permit the concise expression of these paradigms.The Science and Engineering Research Council
    corecore