3 research outputs found

    An Initial Evaluation of the Tera Multithreaded Architecture and Programming System Using the C3I Parallel Benchmark Suite

    Get PDF
    The Tera Multithreaded Architecture (MTA) is a radical new architecture intended to revolutionize high-performance computing in both the scientific and commercial marketplaces. Each processor supports 128 threads in hardware. Extremely fast thread switching is used to mask latency in a uniform-access memory system without caching. It is claimed that these hardware characteristics allow compilers to easily transform sequential programs into efficient multithreaded programs for the Tera MTA. In this paper, we attempt to provide an objective initial evaluation of the performance of the Tera multithreaded architecture and programming system for general-purpose applications. The basis of our investigation is two programs from the C3I Parallel Benchmark Suite (C3IPBS). Both these programs have previously been shown to have the potential for large-scale parallelization. We compare the performance of these programs on (i) a fast uniprocessor, (ii) two conventional shared-memory multiprocessors, and (iii) the first installed Tera MTA (at the San Diego Supercomputer Center). On these platforms, we compare the effectiveness of both automatic and manual parallelization

    An Initial Evaluation of the Tera Multithreaded Architecture and Programming System Using the C3I Parallel Benchmark Suite

    Get PDF
    The Tera Multithreaded Architecture (MTA) is a radical new architecture intended to revolutionize high-performance computing in both the scientific and commercial marketplaces. Each processor supports 128 threads in hardware. Extremely fast thread switching is used to mask latency in a uniform-access memory system without caching. It is claimed that these hardware characteristics allow compilers to easily transform sequential programs into efficient multithreaded programs for the Tera MTA. In this paper, we attempt to provide an objective initial evaluation of the performance of the Tera multithreaded architecture and programming system for general-purpose applications. The basis of our investigation is two programs from the C3I Parallel Benchmark Suite (C3IPBS). Both these programs have previously been shown to have the potential for large-scale parallelization. We compare the performance of these programs on (i) a fast uniprocessor, (ii) two conventional shared-memory multiprocessors, and (iii) the first installed Tera MTA (at the San Diego Supercomputer Center). On these platforms, we compare the effectiveness of both automatic and manual parallelization

    Análisis de rendimiento de aplicaciones paralelas de memoria compartida : problema N-body

    Get PDF
    Este trabajo analiza el rendimiento de cuatro nodos de cómputo multiprocesador de memoria compartida para resolver el problema N-body. Se paraleliza el algoritmo serie, y se codifica usando el lenguaje C extendido con OpenMP. El resultado son dos variantes que obedecen a dos criterios de optimización diferentes: minimizar los requisitos de memoria y minimizar el volumen de cómputo. Posteriormente, se realiza un proceso de análisis de las prestaciones del programa sobre los nodos de cómputo. Se modela el rendimiento de las variantes secuenciales y paralelas de la aplicación, y de los nodos de cómputo; se instrumentan y ejecutan los programas para obtener resultados en forma de varias métricas; finalmente se muestran e interpretan los resultados, proporcionando claves que explican ineficiencias y cuellos de botella en el rendimiento y posibles líneas de mejora. La experiencia de este estudio concreto ha permitido esbozar una incipiente metodología de análisis de rendimiento, identificación de problemas y sintonización de algoritmos a nodos de cómputo multiprocesador de memoria compartida.Aquest treball analitza el rendiment de quatre nodes de còmput multiprocessador de memòria compartida per resoldre el problema N-body. Es paral·lelitza l'algoritme sèrie, i es codifica utilitzant el llenguatge C estès amb OpenMP. El resultat són dues variants que obeeixen a dos criteris d'optimització diferents: minimitzar els requisits de memòria i minimitzar el volum de còmput. Posteriorment, es realitza un procés d'anàlisis de les prestacions del programa sobre els nodes de còmput. Es modela el rendiment de les variants seqüencials i paral·leles de l'aplicació, i dels nodes de còmput; s'instrumenten i s'executen els programes per obtenir resultats en forma de diverses mètriques; finalment es mostren i s'interpreten els resultats, proporcionant claus que expliquen ineficiències i colls d'ampolla en el rendiment i possibles línies de millora. L'experiència d'aquest estudi concret ha permès esbossar una incipient metodologia d'anàlisis de rendiment, identificació de problemes i sintonització d'algoritmes a nodes de còmput multiprocessador de memòria compartida.This research analyzes the performance of four, shared-memory, multiprocessor, computing nodes solving the N-body problem. The sequential algorithm is parallelized and coded using the C language extended by OpenMP. Two program variations are designed, obeying two different optimization goals: minimize memory requirements and minimize the amount of computation. Subsequently, we analyze the program's performance over the computation nodes. We model the performance of the serial and parallel applications and the performance of the computing nodes; the programs are implemented and executed to obtain results in form of several metrics; finally, results are displayed and interpreted, providing keys to explain the performance inefficiencies and bottlenecks, and showing possible areas for improvement. The experience of this study has made possible an incipient methodology to analyze performance, to identify problems, and to tune an algorithm on shared memory multiprocessor nodes
    corecore