3 research outputs found

    Automatic Loop Tuning and Memory Management for Stencil Computations

    Get PDF
    The Texas Instruments C66x Digital Signal Processor (DSP) is an embedded processor technology that is targeted at real time signal processing. It is also developed with a high potential to become the new generation of coprocessor technology for high performance embedded computing. Of particular interest is its performance for stencil computations, such as those found in signal processing and computer vision tasks. A stencil is a loop in which the output value is updated at each position of an array by taking a weighted function of its neighbors. Efficiently mapping stencil-based kernels to the C66x device presents two challenges. The first one is how to efficiently optimize loops in order to facilitate the usage of Single Instruction Multiple Data (SIMD) instructions. On this architecture, like most others, SIMD instructions are not directly generated by the compiler. The second problem is how to manage on-chip memory in a way that minimizes off-chip memory access. Although this could theoretically be achieved by using a highly associative cache, the high rate of data reuse in stencil loops causes a high conflict miss rate. One way to solve this problem is to configure the on-chip memory as a program controlled scratchpad. It allows user to buffer a 2D block of data and minimizes the off-chip data access. For this dissertation, we have accomplished two goals: (1) Develop a methodology for optimization of arbitrary 2D stencils that fully utilize SIMD instructions through microachitecture-aware loop unrolling. (2) Deliver an easy-to-use scratchpad buffer management system and use it to improve the memory efficiency for 2D stencils. We show in the results and analysis section that our stencil compiler is able to achieve up to 2x speed up compared with the code generated by the industrial standard compiler developed by Texas Instruments, and our memory management system is able to achieve up to 10x speed up compared with cache

    Accelerated computation using runtime partial reconfiguration

    Get PDF
    Runtime reconfigurable architectures, which integrate a hard processor core along with a reconfigurable fabric on a single device, allow to accelerate a computation by means of hardware accelerators implemented in the reconfigurable fabric. Runtime partial reconfiguration provides the flexibility to dynamically change these hardware accelerators to adapt the computing capacity of the system. This thesis presents the evaluation of design paradigms which exploit partial reconfiguration to implement compute intensive applications on such runtime reconfigurable architectures. For this purpose, image processing applications are implemented on Zynq-7000, a System on a Chip (SoC) from Xilinx Inc. which integrates an ARM Cortex A9 with a reconfigurable fabric. This thesis studies different image processing applications to select suitable candidates that benefit if implemented on the above mentioned class of reconfigurable architectures using runtime partial reconfiguration. Different Intellectual Property (IP) cores for executing basic image operations are generated using high level synthesis for the implementation. A software based scheduler, executed in the Linux environment running on the ARM core, is responsible for implementing the image processing application by means of loading appropriate IP cores into the reconfigurable fabric. The implementation is evaluated to measure the application speed up, resource savings, power savings and the delay on account of partial reconfiguration. The results of the thesis suggest that the use of partial reconfiguration to implement an application provides FPGA resource savings. The extent of resource savings depend on the granularity of the operations into which the application is decomposed. The thesis could also establish that runtime partial reconfiguration can be used to accelerate the computations in reconfigurable architectures with processor core like the Zynq-7000 platform. The achieved computational speed-up depends on factors like the number of hardware accelerators used for the computation and the used reconfiguration schedule. The thesis also highlights the power savings that may be achieved by executing computations in the reconfigurable fabric instead of the processor core

    Diseño e implementación de un conversor analógico digital escalable y parametrizable en una FPGA

    Get PDF
    La flexibilitat brindada per les FPGAs permet la implementació d'un o més convertidors anàlegs digitals (ADC), cadascun configurat amb una resolució i freqüència de mostreig específics, delimitat per l’aplicació. Aquesta tesi doctoral presenta dos dissenys per a la implementació d'un ADC d’ N-bit escalable i parametritzable a FPGA. EL primer està basat en el circuit one shot ADC i el segon ho està en SAR (Successive Approximation Register). El primer disseny és un ADC d’ N-bit basat en el circuit one Shot, que permet la implementació de l´ADC a partir d´un circuit RC i portes lògiques. Es presenta una metodologia sistemàtica pel disseny de l'ADC d’ N-bit a partir de la resolució, freqüència de mostreig desitjada i rang de la tensió d’entrada. La lògica del one shot és sintetitzable i parametritzable, amb pocs recursos de la FPGA utilitzats i que pot ser extrapolable a d'altres famílies de FPGA. El segon disseny és un ADC de N-bit basat en SAR a partir de diferents mòduls d’ implementació, com el modulador d'amplada de pols (PWM), filtre analògic de Baix pas (LPF) i un comparador analògic. Es presenta una metodologia sistemàtica que permet escollir els paràmetres de l’LPF per un ADC amb característiques específiques (resolució i freqüència de mostreig).La flexibilidad brindada por las FPGAs permite la implementación de uno o varios conversores análogos digitales (ADC), cada uno configurado con una resolución y frecuencia de muestreo específicos, delimitado por la aplicación. Esta tesis doctoral presenta dos diseños para la implementación de un ADC de N-bit escalable y parametrizable en FPGA. EL primero es basado en el circuito one shot ADC y el segundo basado en SAR (Successive Approximation Register) El primer diseño es un ADC de N-bit basado en el circuito one Shot, que permite la implementación del ADC a partir de un circuito RC y puertas lógicas. Se presenta una metodología sistemática para el diseño del ADC de N-bit a partir de la resolución, frecuencia de muestreo deseada y rango de la tensión de entrada. La lógica del one shot es sintetizable y parametrizable, con pocos recursos de la FPGA utilizados y que puede ser extrapolable a otras familias de FPGA. El segundo diseño es un ADC de N-bit basado en SAR a partir de diferentes módulos de implementación, como el modulador de ancho de pulso (PWM), filtro analógico paso bajo (LPF) y un comparador analógico.The flexibility provided by FPGAs permits the implementation of several Analog-to-Digital Converters (ADC), each one configured with the bit resolution and the sampling frequency required by the target application. The doctoral thesis presents two designs for the implementation of scalable and parametrizable N-bit ADC on FPGAs (Field Programmable Gate Arrays). The first design based on one shot circuit and the second design is based on a SAR (Successive Approximation Register). The first design is N-bit ADC based on the one-shot circuit. Combining a RC circuit and logic gates the ADC is implemented. A methodology for the implementation of a parametrizable one shot-based ADC is presented. Based on the sampling frequency, input voltage range and resolution the parameters for the implementation are found. The oneshot logic is synthesizable and parametrizable, using a low number of resources, to be portable to low-cost FPGA families
    corecore