22,373 research outputs found

    Experimental Benchmarks and Initial Evaluation of the Performance of the PASM System Prototype

    Get PDF
    The work reported here represents experiences with the PASM parallel processing system prototype during its first operational year. Most of the experiments were performed by students in the Fall semester of 1987. The first programming, and the first timing measurements, were made during the summer of 1987 by Sam Fineberg. The goal of the collection of experiments presented here was to undertake an Application-driven Architecture Study of the PASM system as a paradigm for parallel architecture evaluation in general. PASM was an excellent vehicle for experimenting with this evaluation technique due to its unique architectural features. Among these are: 1. A reconfigurable, partitionable multistage circuit-switched network. 2. Support for both SIMD and MIMD programs. 3. Ability to execute hybrid SIMD/MIMD programs. 4. An instruction queue which allows overlap of control-flow and data manipulation between micro-control (MC) units and processing elements (PE). It had been hypothesized that superlinear speed-up over the number of PEs could be attained with this feature, and experimental results verified this. 5. Support for barrier synchronization of MIMD tasks. This feature was exploited in some non-standard ways to show the ability to decouple variant length SIMD instructions into multiple MIMD streams for an overall performance benefit. This type of study is expected to continue in the future on PASM and other parallel machines at Purdue. This report should serve as a guide for this future work as well

    On-board B-ISDN fast packet switching architectures. Phase 2: Development. Proof-of-concept architecture definition report

    Get PDF
    For the next-generation packet switched communications satellite system with onboard processing and spot-beam operation, a reliable onboard fast packet switch is essential to route packets from different uplink beams to different downlink beams. The rapid emergence of point-to-point services such as video distribution, and the large demand for video conference, distributed data processing, and network management makes the multicast function essential to a fast packet switch (FPS). The satellite's inherent broadcast features gives the satellite network an advantage over the terrestrial network in providing multicast services. This report evaluates alternate multicast FPS architectures for onboard baseband switching applications and selects a candidate for subsequent breadboard development. Architecture evaluation and selection will be based on the study performed in phase 1, 'Onboard B-ISDN Fast Packet Switching Architectures', and other switch architectures which have become commercially available as large scale integration (LSI) devices

    VLSI Design

    Get PDF
    This book provides some recent advances in design nanometer VLSI chips. The selected topics try to present some open problems and challenges with important topics ranging from design tools, new post-silicon devices, GPU-based parallel computing, emerging 3D integration, and antenna design. The book consists of two parts, with chapters such as: VLSI design for multi-sensor smart systems on a chip, Three-dimensional integrated circuits design for thousand-core processors, Parallel symbolic analysis of large analog circuits on GPU platforms, Algorithms for CAD tools VLSI design, A multilevel memetic algorithm for large SAT-encoded problems, etc

    General Purpose Computation on Graphics Processing Units Using OpenCL

    Get PDF
    Computational Science has emerged as a third pillar of science along with theory and experiment, where the parallelization for scientific computing is promised by different shared and distributed memory architectures such as, super-computer systems, grid and cluster based systems, multi-core and multiprocessor systems etc. In the recent years the use of GPUs (Graphic Processing Units) for General purpose computing commonly known as GPGPU made it an exciting addition to high performance computing systems (HPC) with respect to price and performance ratio. Current GPUs consist of several hundred computing cores arranged in streaming multi-processors so the degree of parallelism is promising. Moreover with the development of new and easy to use interfacing tools and programming languages such as OpenCL and CUDA made the GPUs suitable for different computation demanding applications such as micromagnetic simulations. In micromagnetic simulations, the study of magnetic behavior at very small time and space scale demands a huge computation time, where the calculation of magnetostatic field with complexity of O(Nlog(N)) using FFT algorithm for discrete convolution is the main contribution towards the whole simulation time, and it is computed many times at each time step interval. This study and observation of magnetization behavior at sub-nanosecond time-scales is crucial to a number of areas such as magnetic sensors, non volatile storage devices and magnetic nanowires etc. Since micromagnetic codes in general are suitable for parallel programming as it can be easily divided into independent parts which can run in parallel, therefore current trend for micromagnetic code concerns shifting the computationally intensive parts to GPUs. My PhD work mainly focuses on the development of highly parallel magnetostatic field solver for micromagnetic simulators on GPUs. I am using OpenCL for GPU implementation, with consideration that it is an open standard for parallel programming of heterogeneous systems for cross platform. The magnetostatic field calculation is dominated by the multidimensional FFTs (Fast Fourier Transform) computation. Therefore i have developed the specialized OpenCL based 3D-FFT library for magnetostatic field calculation which made it possible to fully exploit the zero padded input data with out transposition and symmetries inherent in the field calculation. Moreover it also provides a common interface for different vendors' GPUs. In order to fully utilize the GPUs parallel architecture the code needs to handle many hardware specific technicalities such as coalesced memory access, data transfer overhead between GPU and CPU, GPU global memory utilization, arithmetic computation, batch execution etc. In the second step to further increase the level of parallelism and performance, I have developed a parallel magnetostatic field solver on multiple GPUs. Utilizing multiple GPUs avoids dealing with many of the limitations of GPUs (e.g., on-chip memory resources) by exploiting the combined resources of multiple on board GPUs. The GPU implementation have shown an impressive speedup against equivalent OpenMp based parallel implementation on CPU, which means the micromagnetic simulations which require weeks of computation on CPU now can be performed very fast in hours or even in minutes on GPUs. In parallel I also worked on ordered queue management on GPUs. Ordered queue management is used in many applications including real-time systems, operating systems, and discrete event simulations. In most cases, the efficiency of an application itself depends on usage of a sorting algorithm for priority queues. Lately, the usage of graphic cards for general purpose computing has again revisited sorting algorithms. In this work i have presented the analysis of different sorting algorithms with respect to sorting time, sorting rate and speedup on different GPU and CPU architectures and provided a new sorting technique on GPU

    Parallel prefix operations on heterogeneous platforms

    Get PDF
    Programa Oficial de Doutoramento en Investigación en Tecnoloxías da Información. 524V01[Resumo] As tarxetas gráficas, coñecidas como GPUs, aportan grandes vantaxes no rendemento computacional e na eficiencia enerxética, sendo un piar clave para a computación de altas prestacións (HPC). Sen embargo, esta tecnoloxía tamén é custosa de programar, e ten certos problemas asociados á portabilidade entre as diferentes tarxetas. Por autra banda, os algoritmos de prefixo paralelo son un conxunto de algoritmos paralelos regulares e moi empregados nas ciencias compuacionais, cuxa eficiencia é esencial en moita."3 aplicacións. Neste eiclo, aínda que as GPUs poden acelerar a computación destes algoritmos, tamén poden ser unha limitación cando non explotan axeitadamente o paralelismo da arquitectura CPU. Esta Tese presenta dúas perspectivas. Dunha parte, deséñanse novos algoritmos de prefixo paralelo para calquera paradigma de programación paralela. Pola outra banda, tamén se propón unha metodoloxÍa xeral que implementa eficientemente algoritmos de prefixo paralelos, de xeito doado e portable, sobre arquitecturas GPU CUDA, mais que se centrar nun algoritmo particular ou nun modelo concreto de tarxeta. Para isto, a metodoloxía identifica os paramétros da GPU que inflúen no rendemento e, despois, seguindo unha serie de premisas teóricas, obtéñense os valores óptimos destes parámetros dependendo do algoritmo, do tamaño do problema e da arquitectura GPU empregada. Ademais, esta Tese tamén prové unha serie de fUllciólls GPU compostas de bloques de código CUDA modulares e reutilizables, o que permite a implementación de calquera algoritmo de xeito sinxelo. Segundo o tamaño do problema, propóñense tres aproximacións. As dúas primeiras resolven problemas pequenos, medios e grandes nunha única GPU) mentras que a terceira trata con tamaños extremad8.1nente grandes, usando varias GPUs. As nosas propostas proporcionan uns resultados moi competitivos a nivel de rendemento, mellorando as propostas existentes na bibliografía para as operacións probadas: a primitiva sean, ordenación e a resolución de sistemas tridiagonais.[Resumen] Las tarjetas gráficas (GPUs) han demostrado gmndes ventajas en el rendimiento computacional y en la eficiencia energética, siendo una tecnología clave para la computación de altas prestaciones (HPC). Sin embargo, esta tecnología también es costosa de progTamar, y tiene ciertos problemas asociados a la portabilidad de sus códigos entre diferentes generaciones de tarjetas. Por otra parte, los algoritmos de prefijo paralelo son un conjunto de algoritmos regulares y muy utilizados en las ciencias computacionales, cuya eficiencia es crucial en muchas aplicaciones. Aunque las GPUs puedan acelerar la computación de estos algoritmos, también pueden ser una limitación si no explotan correctamente el paralelismo de la arquitectura CPU. Esta Tesis presenta dos perspectivas. De un lado, se han diseñado nuevos algoritmos de prefijo paralelo que pueden ser implementados en cualquier paradigma de programación paralela. Por otra parte, se propone una metodología general que implementa eficientemente algoritmos de prefijo paralelo, de forma sencilla y portable, sobre cualquier arquitectura GPU CUDA, sin centrarse en un algoritmo particular o en un modelo de tarjeta. Para ello, la metodología identifica los parámetros GPU que influyen en el rendimiento y, siguiendo un conjunto de premisas teóricas, obtiene los valores óptimos para cada algoritmo, tamaño de problema y arquitectura. Además, las funciones GPU proporcionadas están compuestas de bloques de código CUDA reutilizable y modular, lo que permite la implementación de cualquier algoritmo de prefijo paralelo sencillamente. Dependiendo del tamaño del problema, se proponen tres aproximaciones. Las dos primeras resuelven tamaños pequeños, medios y grandes, utilizando para ello una única GPU i mientras que la tercera aproximación trata con tamaños extremadamente grandes, usando varias GPUs. Nuestras propuestas proporcionan resultados muy competitivos, mejorando el rendimiento de las propuestas existentes en la bibliografía para las operaciones probadas: la primitiva sean, ordenación y la resolución de sistemas tridiagonales.[Abstract] Craphics Processing Units (CPUs) have shown remarkable advantages in computing performance and energy efficiency, representing oue of the most promising trends fúr the near-fnture of high perfonnance computing. However, these devices also bring sorne programming complexities, and many efforts are required tú provide portability between different generations. Additionally, parallel prefix algorithms are a 8et of regular and highly-used parallel algorithms, whose efficiency is crutial in roany computer sCience applications. Although GPUs can accelerate the computation of such algorithms, they can also be a limitation when they do not match correctly to the CPU architecture or do not exploit the CPU parallelism properly. This dissertation presents two different perspectives. Gn the Oile hand, new parallel prefix algorithms have been algorithmicany designed for any paranel progrannning paradigm. On the other hand, a general tuning CPU methodology is proposed to provide an easy and portable mechanism tú efficiently implement paranel prefix algorithms on any CUDA CPU architecture, rather than focusing on a particular algorithm or a CPU mode!. To accomplish this goal, the methodology identifies the GPU parameters which influence on the performance and, following a set oí performance premises, obtains the cOllvillient values oí these parameters depending on the algorithm, the problem size and the CPU architecture. Additionally, the provided CPU functions are composed of modular and reusable CUDA blocks of code, which allow the easy implementation of any paranel prefix algorithm. Depending on the size of the dataset, three different approaches are proposed. The first two approaches solve small and medium-large datasets on a single GPU; whereas the third approach deals with extremely large datasets on a Multiple-CPU environment. OUT proposals provide very competitive performance, outperforming the stateof- the-art for many parallel prefix operatiOllS, such as the sean primitive, sorting and solving tridiagonal systems

    Parallelization of Finite Element Analysis Codes Using Heterogeneous Distributed Computing

    Get PDF
    Performance gains in computer design are quickly consumed as users seek to analyze larger problems to a higher degree of accuracy. Innovative computational methods, such as parallel and distributed computing, seek to multiply the power of existing hardware technology to satisfy the computational demands of large applications. In the early stages of this project, experiments were performed using two large, coarse-grained applications, CSTEM and METCAN. These applications were parallelized on an Intel iPSC/860 hypercube. It was found that the overall speedup was very low, due to large, inherently sequential code segments present in the applications. The overall execution time T(sub par), of the application is dependent on these sequential segments. If these segments make up a significant fraction of the overall code, the application will have a poor speedup measure
    corecore