102 research outputs found
Visualization of program performance on concurrent computers
A distributed memory concurrent computer (such as a hypercube computer) is inherently a complex system involving the collective and simultaneous interaction of many entities engaged in computation and communication activities. Program performance evaluation in concurrent computer systems requires methods and tools for observing, analyzing, and displaying system performance. This dissertation describes a methodology for collecting and displaying, via a unique graphical approach, performance measurement information from (possibly large) concurrent computer systems. Performance data are generated and collected via instrumentation. The data are then reduced via conventional cluster analysis techniques and converted into a pictorial form to highlight important aspects of program states during execution. Local and summary statistics are calculated. Included in the suite of defined metrics are measures for quantifying and comparing amounts of computation and communication. A novel kind of data plot is introduced to visually display both temporal and spatial information describing system activity. Phenomena such as hot spots of activity are easily observed, and in some cases, patterns inherent in the application algorithms being studied are highly visible. The approach also provides a framework for a visual solution to the problem of mapping a given parallel algorithm to an underlying parallel machine. A prototype implementation applied to several case studies is presented to demonstrate the feasibility and power of the approach
Towards a high performance cellular automata programming skeleton
Cellular automata provide an abstract model of parallel computation that can be effectively used for modeling and simulation of complex phenomena and systems. In this paper, we start from a skeleton designed to facilitate faster D-dimensional cellular automata application development. The key for the use of the skeleton is to achieve an efficient implementation, irrespective of the application specific details. In the parallel implementation on a cluster was important to consider issues such as task and data decomposition. With multicore clusters, new problems have emerged. The increasing numbers of cores per node, caches and shared memory inside the nodes, has led to the formation of a new hierarchy of access to processors. In this paper, we described some optimizations to restructuring the prototype code and exposing an abstracted view of the multicore cluster to the high performance CA application developer. The implementation of lattice division functions establishes a partnership relation among parallel processes. We propose that this relation can efficiently map on the multicore cluster communicational topology. We introduce a new mapping strategy that can obtain benefit in the performance by adapting its communication pattern to the hardware affinities among processes allocated in different cores. We apply our approach to a two-dimensional application achieving sensible execution time reduction.Presentado en el X Workshop Procesamiento Distribuido y Paralelo (WPDP)Red de Universidades con Carreras en Informática (RedUNCI
Towards a high performance cellular automata programming skeleton
Cellular automata provide an abstract model of parallel computation that can be effectively used for modeling and simulation of complex phenomena and systems. In this paper, we start from a skeleton designed to facilitate faster D-dimensional cellular automata application development. The key for the use of the skeleton is to achieve an efficient implementation, irrespective of the application specific details. In the parallel implementation on a cluster was important to consider issues such as task and data decomposition. With multicore clusters, new problems have emerged. The increasing numbers of cores per node, caches and shared memory inside the nodes, has led to the formation of a new hierarchy of access to processors. In this paper, we described some optimizations to restructuring the prototype code and exposing an abstracted view of the multicore cluster to the high performance CA application developer. The implementation of lattice division functions establishes a partnership relation among parallel processes. We propose that this relation can efficiently map on the multicore cluster communicational topology. We introduce a new mapping strategy that can obtain benefit in the performance by adapting its communication pattern to the hardware affinities among processes allocated in different cores. We apply our approach to a two-dimensional application achieving sensible execution time reduction.Presentado en el X Workshop Procesamiento Distribuido y Paralelo (WPDP)Red de Universidades con Carreras en Informática (RedUNCI
On the Effect of Quantum Interaction Distance on Quantum Addition Circuits
We investigate the theoretical limits of the effect of the quantum
interaction distance on the speed of exact quantum addition circuits. For this
study, we exploit graph embedding for quantum circuit analysis. We study a
logical mapping of qubits and gates of any -depth quantum adder
circuit for two -qubit registers onto a practical architecture, which limits
interaction distance to the nearest neighbors only and supports only one- and
two-qubit logical gates. Unfortunately, on the chosen -dimensional practical
architecture, we prove that the depth lower bound of any exact quantum addition
circuits is no longer , but . This
result, the first application of graph embedding to quantum circuits and
devices, provides a new tool for compiler development, emphasizes the impact of
quantum computer architecture on performance, and acts as a cautionary note
when evaluating the time performance of quantum algorithms.Comment: accepted for ACM Journal on Emerging Technologies in Computing
System
Towards a high performance cellular automata programming skeleton
Cellular automata provide an abstract model of parallel computation that can be effectively used for modeling and simulation of complex phenomena and systems. In this paper, we start from a skeleton designed to facilitate faster D-dimensional cellular automata application development. The key for the use of the skeleton is to achieve an efficient implementation, irrespective of the application specific details. In the parallel implementation on a cluster was important to consider issues such as task and data decomposition. With multicore clusters, new problems have emerged. The increasing numbers of cores per node, caches and shared memory inside the nodes, has led to the formation of a new hierarchy of access to processors. In this paper, we described some optimizations to restructuring the prototype code and exposing an abstracted view of the multicore cluster to the high performance CA application developer. The implementation of lattice division functions establishes a partnership relation among parallel processes. We propose that this relation can efficiently map on the multicore cluster communicational topology. We introduce a new mapping strategy that can obtain benefit in the performance by adapting its communication pattern to the hardware affinities among processes allocated in different cores. We apply our approach to a two-dimensional application achieving sensible execution time reduction.Presentado en el X Workshop Procesamiento Distribuido y Paralelo (WPDP)Red de Universidades con Carreras en Informática (RedUNCI
Explotando jerarquías de memoria distribuida/compartida con Hitmap
Actualmente los clústers de computadoras que se utilizan para computación de alto
rendimiento se construyen interconectando máquinas de memoria compartida. Como modelo
de programación común para este tipo de clústers se puede usar el paradigma del
paso de mensajes, lanzando tantos procesos como núcleos disponibles tengamos entre todas
las máquinas del clúster. Sin embargo, esta forma de programación no es eficiente.
Para conseguir explotar eficientemente estos sistemas jerárquicos es necesario una combinación de diferentes modelos de programación y herramientas, adecuada cada una de
ellas para los diferentes niveles de la plataforma de ejecución.
Este trabajo presenta un método que facilita la programación para entornos que combinan
memoria distribuida y compartida. La coordinación en el nivel de memoria distribuida
se facilita usando la biblioteca Hitmap. Mostraremos como integrar Hitmap con modelos
de programación para memoria compartida y con herramientas automáticas que paralelizan
y optimizan código secuencial. Esta nueva combinación permitirá explotar las técnicas
más apropiadas para cada nivel del sistema además de facilitar la generación de programas
paralelos multinivel que adaptan automáticamente su estructura de comunicaciones
y sincronización a la máquina donde se ejecuta. Los resultados experimentales muestran
como la propuesta del trabajo mejora los mejores resultados obtenidos con programas de
referencia optimizados manualmente usando MPI u OpenMP.Departamento de Informática (Arquitectura y Tecnología de Computadores, Ciencias de la Computación e Inteligencia Artificial, Lenguajes y Sistemas Informáticos)Máster en Investigación en Tecnologías de la Información y las Comunicacione
- …