49 research outputs found

    Continuous-Flow Matrix Transposition Using Memories

    Get PDF
    In this paper, we analyze how to calculate the matrix transposition in continuous flow by using a memory or group of memories. The proposed approach studies this problem for specific conditions such as square and non-square matrices, use of limited access memories and use of several memories in parallel. Contrary to previous approaches, which are based on specific cases or examples, the proposed approach derives the fundamental theory involved in the problem of matrix transposition in a continuous flow. This allows for obtaining the exact equations for the read and write addresses of the memories and other control signals in the circuits. Furthermore, the cases that involve non-square matrices, which have not been studied in detail in the literature, are analyzed in depth in this paper. Experimental results show that the proposed approach is capable of transposing matrices of 8192 times 8192 32-bit data received in series at a rate of 200 mega samples per second, which doubles the throughput of previous approaches. 卢漏 2004-2012 IEEE

    FPGA ARCHITECTURE FOR 2D DISCRETE FOURIER TRANSFORM BASED ON 2D DECOMPOSITION FOR LARGE-SIZED DATA

    Get PDF
    ABSTRACT Applications based on Discrete Fourier Transforms (DFT) are extensively used in various areas of signal and digital image processing. Of particular interest is the two-dimensional (2D) DFT which is more computation-and bandwidth-intensive than the one-dimensional (ID) DFT. Traditionally, a 2D DFT is computed using Row-Column (RC) decomposition, where ID DFTs are computed along the rows followed by ID DFTs along the columns. Both application specific and reconfigurable hardware have been used for high-performance implementations of 2D DFT. However, architectures based on RC decomposition are not efficient for large input size data due to memory bandwidth constraints. In this paper, we propose an efficient architecture to implement the 2D DFT for largesized input data based on a novel 2D decomposition algorithm. This architecture achieves very high throughput by exploiting the inherent parallelism due to the algorithm decomposition and by utilizing the row-wise burst access pattern of the external memory. A high throughput memory interface has been designed to enable maximum utilization of the memory bandwidth. In addition, an automatic system generator is provided for mapping this architecture onto a reconfigurable platform of Xilinx Virtex5 devices. For a 2K x 2K input size, the proposed architecture is 1.96x times faster than RC decomposition based implementation under the same memory constraints, and also outperforms other existing implementations

    Architecture-Aware Optimization on a 1600-core Graphics Processor

    Get PDF
    The graphics processing unit (GPU) continues to make significant strides as an accelerator in commodity cluster computing for high-performance computing (HPC). For example, three of the top five fastest supercomputers in the world, as ranked by the TOP500, employ GPUs as accelerators. Despite this increasing interest in GPUs, however, optimizing the performance of a GPU-accelerated compute node requires deep technical knowledge of the underlying architecture. Although significant literature exists on how to optimize GPU performance on the more mature NVIDIA CUDA architecture, the converse is true for OpenCL on the AMD GPU. Consequently, we present and evaluate architecture-aware optimizations for the AMD GPU. The most prominent optimizations include (i) explicit use of registers, (ii) use of vector types, (iii) removal of branches, and (iv) use of image memory for global data. We demonstrate the efficacy of our AMD GPU optimizations by applying each optimization in isolation as well as in concert to a large-scale, molecular modeling application called GEM. Via these AMD-specific GPU optimizations, the AMD Radeon HD 5870 GPU delivers 65% better performance than with the wellknown NVIDIA-specific optimizations

    Memory system support for image processing

    Get PDF
    Journal ArticleProcessor speeds are increasing rapidly, but memory speeds are not keeping pace. Image processing is an important application domain that is particularly impacted by this growing performance gap. Image processing algorithms tend to have poor memory locality because they access their data in a non-sequential fashion and reuse that data infrequently. As a result, they often exhibit poor cache and TLB hit rates on conventional memory systems, which limits overall performance. Most current approaches to addressing the memory bottleneck focus on modifying cache organizations or introducing processor-based prefetching. The Impulse memory system takes a different approach: allowing application software to control how, when, and where data are loaded into a conventional processor cache. Impulse does this by letting software configure how the memory controller interprets the physical addresses exported by a processor. Introducing an extra level of address translation in the memory. Data that is sparse in memory can be accessed densely, which improves both cache and TLB utilization, and Impulse hides memory latency by prefectching data within the memory controller. We describe how Impulse improves the performance of three image processing algorithms: an Impulse memory system yields speedups of 40% to 226% over an otherwise identical machine with a conventional memory system

    Castell: a heterogeneous cmp architecture scalable to hundreds of processors

    Get PDF
    Technology improvements and power constrains have taken multicore architectures to dominate microprocessor designs over uniprocessors. At the same time, accelerator based architectures have shown that heterogeneous multicores are very efficient and can provide high throughput for parallel applications, but with a high-programming effort. We propose Castell a scalable chip multiprocessor architecture that can be programmed as uniprocessors, and provides the high throughput of accelerator-based architectures. Castell relies on task-based programming models that simplify software development. These models use a runtime system that dynamically finds, schedules, and adds hardware-specific features to parallel tasks. One of these features is DMA transfers to overlap computation and data movement, which is known as double buffering. This feature allows applications on Castell to tolerate large memory latencies and lets us design the memory system focusing on memory bandwidth. In addition to provide programmability and the design of the memory system, we have used a hierarchical NoC and added a synchronization module. The NoC design distributes memory traffic efficiently to allow the architecture to scale. The synchronization module is a consequence of the large performance degradation of application for large synchronization latencies. Castell is mainly an architecture framework that enables the definition of domain-specific implementations, fine-tuned to a particular problem or application. So far, Castell has been successfully used to propose heterogeneous multicore architectures for scientific kernels, video decoding (using H.264), and protein sequence alignment (using Smith-Waterman and clustalW). It has also been used to explore a number of architecture optimizations such as enhanced DMA controllers, and architecture support for task-based programming models. ii

    The Potential for a GPU-Like Overlay Architecture for FPGAs

    Get PDF
    We propose a soft processor programming model and architecture inspired by graphics processing units (GPUs) that are well-matched to the strengths of FPGAs, namely, highly parallel and pipelinable computation. In particular, our soft processor architecture exploits multithreading, vector operations, and predication to supply a floating-point pipeline of 64 stages via hardware support for up to 256 concurrent thread contexts. The key new contributions of our architecture are mechanisms for managing threads and register files that maximize data-level and instruction-level parallelism while overcoming the challenges of port limitations of FPGA block memories as well as memory and pipeline latency. Through simulation of a system that (i) is programmable via NVIDIA's high-level Cg language, (ii) supports AMD's CTM r5xx GPU ISA, and (iii) is realizable on an XtremeData XD1000 FPGA-based accelerator system, we demonstrate the potential for such a system to achieve 100% utilization of a deeply pipelined floating-point datapath

    Hardware/Software Co-design for Multicore Architectures

    Get PDF
    Siirretty Doriast

    Implementing Real-Time Video Deblocking in FPGA Hardware

    Get PDF
    Video compression techniques are commonly used to meet the increasing demands for the storage and transmission of digital video content. Popular video compression techniques such as MPEG video encoding make use of block-transform coding algorithms which are susceptible to blocking artifacts. These artifacts can be reduced using a deblocking process, of which there are many. However, those deblocking algorithms which provide noticeable improvements in visual quality also tend to be computationally expensive and unsuitable for real-time video use. This dissertation selects and examines an appropriate algorithm for real-time video deblocking applications, and describes its hardware implementation on a Altera Cyclone II FPGA. The chosen algorithm is based on the concept of shifted thresholding; it reduces computational complexity by several means, such as by using only integer arithmetic and by replacing division operations with bit shifting. The implementation leverages the reduced hardware complexity of the chosen algorithm to cost-effectively implement real-time video deblocking

    On the programmability of multi-GPU computing systems

    Get PDF
    Multi-GPU systems are widely used in High Performance Computing environments to accelerate scientific computations. This trend is expected to continue as integrated GPUs will be introduced to processors used in multi-socket servers and servers will pack a higher number of GPUs per node. GPUs are currently connected to the system through the PCI Express interconnect, which provides limited bandwidth (compared to the bandwidth of the memory in GPUs) and it often becomes a bottleneck for performance scalability. Current programming models present GPUs as isolated devices with their own memory, even if they share the host memory with the CPU. Programmers explicitly manage allocations in all GPU memories and use primitives to communicate data between GPUs. Furthermore, programmers are required to use mechanisms such as command queues and inter-GPU synchronization. This explicit model harms the maintainability of the code and introduces new sources for potential errors. The first proposal of this thesis is the HPE model. HPE builds a simple, consistent programming interface based on three major features. (1) All device address spaces are combined with the host address space to form a Unified Virtual Address Space. (2) Programs are provided with an Asymmetric Distributed Shared Memory system for all the GPUs in the system. It allows to allocate memory objects that can be accessed by any GPU or CPU. (3) Every CPU thread can request a data exchange between any two GPUs, through simple memory copy calls. Such a simple interface allows HPE to provide always the optimal implementation; eliminating the need for application code to handle different system topologies. Experimental results show improvements on real applications that range from 5% in compute-bound benchmarks to 2.6x in communication-bound benchmarks. HPE transparently implements sophisticated communication schemes that can deliver up to a 2.9x speedup in I/O device transfers. The second proposal of this thesis is a shared memory programming model that exploits the new GPU capabilities for remote memory accesses to remove the need for explicit communication between GPUs. This model turns a multi-GPU system into a shared memory system with NUMA characteristics. In order to validate the viability of the model we also perform an exhaustive performance analysis of remote memory accesses over PCIe. We show that the unique characteristics of the GPU execution model and memory hierarchy help to hide the costs of remote memory accesses. Results show that PCI Express 3.0 is able to hide the costs of up to a 10% of remote memory accesses depending on the access pattern, while caching of remote memory accesses can have a large performance impact on kernel performance. Finally, we introduce AMGE, a programming interface, compiler support and runtime system that automatically executes computations that are programmed for a single GPU across all the GPUs in the system. The programming interface provides a data type for multidimensional arrays that allows for robust, transparent distribution of arrays across all GPU memories. The compiler extracts the dimensionality information from the type of each array, and is able to determine the access pattern in each dimension of the array. The runtime system uses the compiler-provided information to automatically choose the best computation and data distribution configuration to minimize inter-GPU communication and memory footprint. This model effectively frees programmers from the task of decomposing and distributing computation and data to exploit several GPUs. AMGE achieves almost linear speedups for a wide range of dense computation benchmarks on a real 4-GPU system with an interconnect with moderate bandwidth. We show that irregular computations can also benefit from AMGE, too.Los sistemas multi-GPU son muy com煤nmente utilizados en entornos de computaci贸n de altas prestaciones para acelerar c谩lculos cient铆ficos. Esta tendencia continuar谩 con la introducci贸n de GPUs integradas en los procesadores de los servidores procesador y con una mayor densidad de GPUs por nodo. Las GPUs actualmente se contectan al sistema a trav茅s de una interconexi贸n PCI Express, que provee un ancho de banda reducido (comparado con las memorias de las GPUs) y habitualmente se convierte en el cuello de botella para escalar el rendimiento. Los modelos de programaci贸n actuales exponen las GPUs como dispositivos aislados con su propia memoria, incluso si comparten la memoria f铆sica con la CPU. Los programadores manejan diferentes reservas en todas las memorias de GPU y usan primitivas para comunicar datos entre GPUs. Adem谩s, los programadores deben utilizar mecanismos como colas de comandos y sincronicaci贸n entre GPUs. Este modelo expl铆cito empeora la programabilidad del c贸digo e introduce nuevas fuentes de errores potenciales. La primera propuesta de esta tesis es el modelo HPE. HPE construye una interfaz de programaci 贸n consistente basada en tres caracter铆sticas principales. (1) Todos los espacios de direcciones de los dispositivos son combinados para formar un espacio de direcciones unificado. (2) Los programas usan un sistema asim茅trico distribuido de memoria compartida para todas las GPUs del sistema, que permite declarar objetos de memoria que pueden ser accedidos por cualquier GPU o CPU. (3) Cada hilo de ejecuci贸n de la CPU puede lanzar un intercambio de datos entre dos GPUs a trav茅s de simples llamadas de copia de memoria. Esta interfaz simplificada permite a HPE usar la implementaci 贸n 贸ptima; sinque la aplicaci贸n contemple diferentes topolog铆as de sistema. Los resultados experimentales muestran mejoras en aplicaciones reales que van desde un 5% en aplicaciones limitadas por el c贸mputo a 2.6x aplicaciones imitadas por la comunicaci贸n. HPE implementa sofisticados esquemas de transferencia para dispositivos de E/S que proporcionan mejoras de rendimiento de 2.9x. La segunda propuesta de esta tesis es un modelo de programaci贸n basado en memoria compartida que aprovecha las nuevas capacidades acceso remoto de memoria de las GPUs para eliminar la comunicaci贸n expl铆cita entre memorias de GPU. Este modelo convierte un sistema multi-GPU en un sistema de memoria compartida con caracter铆sticas NUMA. Para validar la viabilidad del modelo realizamos un anl谩sis exhaustivo del rendimiento los accessos de memoria remotos sobre PCIe. Los resultados muestran que PCI Express 3.0 elimina los costes de hasta un 10% de accesos remotos, dependiendo en el patr贸n de acceso, mientras que guardar los accesos remotos en memorias cache tiene un gran inpacto en el rendimiento de las computaciones. Finalmente, presentamos AMGE, una interfaz de programaci贸n con soporte de compilaci贸n y un sistema que ejecuta, de forma autom谩tica, computaciones programadas para una 煤nica GPU en todas las GPUs del sistema. La interfaz de programaci贸n proporciona un tipo de datos para arreglos multidimensionales que permite una distribuci 贸n transparente y robusta de los datos en todas las memorias de GPU. El compilador extrae la informaci贸n sobre la dimensionalidad de cada arreglo y puede determinar el patr贸n de acceso en cada dimensi贸n de forma individual. El sistema utiliza, en tiempo de ejecuci贸n, la informaci贸n del compilador para elegir la mejor descomposici贸n de la computaci贸n y los datos para minimizar la comunicaci贸n entre GPUs y el uso de memoria. AMGE consigue mejoras de rendimiento que crecen de forma lineal con el n煤mero de GPUs para un amplio abanico de computaciones densas en un sistema real con 4 GPUs. Tambi茅n mostramos que las computaciones con patrones irregulares tambi茅n se pueden beneficiar de AMGE

    The Case for Polymorphic Registers in Dataflow Computing

    Get PDF
    Heterogeneous systems are becoming increasingly popular, delivering high performance through hardware specialization. However, sequential data accesses may have a negative impact on performance. Data parallel solutions such as Polymorphic Register Files (PRFs) can potentially accelerate applications by facilitating high-speed, parallel access to performance-critical data. This article shows how PRFs can be integrated into dataflow computational platforms. Our semi-automatic, compiler-based methodology generates customized PRFs and modifies the computational kernels to efficiently exploit them. We use a separable 2D convolution case study to evaluate the impact of memory latency and bandwidth on performance compared to a state-of-the-art NVIDIA Tesla C2050 GPU. We improve the throughput up to 56.17X and show that the PRF-augmented system outperforms the GPU for 9脳9 or larger mask sizes, even in bandwidth-constrained systems
    corecore