75 research outputs found

    AutoAccel: Automated Accelerator Generation and Optimization with Composable, Parallel and Pipeline Architecture

    Full text link
    CPU-FPGA heterogeneous architectures are attracting ever-increasing attention in an attempt to advance computational capabilities and energy efficiency in today's datacenters. These architectures provide programmers with the ability to reprogram the FPGAs for flexible acceleration of many workloads. Nonetheless, this advantage is often overshadowed by the poor programmability of FPGAs whose programming is conventionally a RTL design practice. Although recent advances in high-level synthesis (HLS) significantly improve the FPGA programmability, it still leaves programmers facing the challenge of identifying the optimal design configuration in a tremendous design space. This paper aims to address this challenge and pave the path from software programs towards high-quality FPGA accelerators. Specifically, we first propose the composable, parallel and pipeline (CPP) microarchitecture as a template of accelerator designs. Such a well-defined template is able to support efficient accelerator designs for a broad class of computation kernels, and more importantly, drastically reduce the design space. Also, we introduce an analytical model to capture the performance and resource trade-offs among different design configurations of the CPP microarchitecture, which lays the foundation for fast design space exploration. On top of the CPP microarchitecture and its analytical model, we develop the AutoAccel framework to make the entire accelerator generation automated. AutoAccel accepts a software program as an input and performs a series of code transformations based on the result of the analytical-model-based design space exploration to construct the desired CPP microarchitecture. Our experiments show that the AutoAccel-generated accelerators outperform their corresponding software implementations by an average of 72x for a broad class of computation kernels

    High Performance Post-Quantum Key Exchange on FPGAs

    Get PDF
    Lattice-based cryptography is a highly potential candidate that protects against the threat of quantum attack. At Usenix Security 2016, Alkim, Ducas, Pöpplemann, and Schwabe proposed a post-quantum key exchange scheme called NewHope, based on a variant of lattice problem, the ring-learning-with-errors (RLWE) problem. In this work, we propose a high performance hardware architecture for NewHope. Our implementation requires 6,680 slices, 9,412 FFs, 18,756 LUTs, 8 DSPs and 14 BRAMs on Xilinx Zynq-7000 equipped with 28mm Artix-7 7020 FPGA. In our hardware design of NewHope key exchange, the three phases of key exchange costs 51.9, 78.6 and 21.1 microseconds, respectively. It achieves more than 4.8 times better in terms of area-time product comparing to previous results of hardware implementation of NewHope-Simple from Oder and Güneysu at Latincrypt 2017

    The Non Uniform Adaptive Angular Spectrum Method and its Application to Neural Network Assisted Coherent Beam Combining

    Get PDF
    Increasing the number of laser beams that can be coherently combined requires accurate and fast algorithms for compensating phase and alignment errors. The paper proposes to use a Fully Connected Artificial Neural Network (FCANN) to correct the beam positioning perturbations by evaluating the beam shifts and tilts from two images taken at slightly different locations. Then, since it is practically impossible to have a large enough experimental dataset to train the neural network, this approach required developing an accurate and fast simulation method to evaluate the beam propagation in arbitrary directions, overcoming the limitations occurring when the computation must be repeated a large number of times. The numerical approach is a variant of the Angular Spectrum (AS) method, called Non Uniform ADaptive Angular Spectrum (NUADAS) method, which relies on the combination of non-uniform and adaptive Fourier transform algorithms to allow the computation of an arbitrary field distribution in a plane that is shifted and tilted with respect to the source. The parallel implementation of the NUADAS method is discussed and the numerical and experimental validations are presented. Then, an FCANN is trained using the synthetic dataset generated with the NUADAS method and the results are discussed, demonstrating the viability of the proposed approach not only for coherent beam combing, but also in other beam alignment applications

    High Performance Non-uniform FFT on Modern x86-based Multi-core Systems

    Get PDF
    Abstract-The Non-Uniform Fast Fourier Transform (NUFFT) is a generalization of FFT to non-equidistant samples. It has many applications which vary from medical imaging to radio astronomy to the numerical solution of partial differential equations. Despite recent advances in speeding up NUFFT on various platforms, its practical applications are still limited, due to its high computational cost, which is significantly dominated by the convolution of a signal between a nonuniform and uniform grids. The computational cost of the NUFFT is particularly detrimental in cases which require fast reconstruction times, such as iterative 3D non-Cartesian MRI reconstruction. We propose novel and highly scalable parallel algorithm for performing NUFFT on x86-based multi-core CPUs. The high performance of our algorithm relies on good SIMD utilization and high parallel efficiency. On convolution, we demonstrate on average 90% SIMD efficiency using SSE, as well up to linear scalability using a quad-socket 40-core Intel R Xeon R E7-4870 Processors based system. As a result, on dual socket Intel R Xeon R X5670 based server, our NUFFT implementation is more than 4x faster compared to the best available NUFFT3D implementation, when run on the same hardware. On Intel R Xeon R E5-2670 processor based server, our NUFFT implementation is 1.5X faster than any published NUFFT implementation today. Such speed improvement opens new usages for NUFFT. For example, iterative multichannel reconstruction of a 240x240x240 image could execute in just over 3 minutes, which is on the same order as contemporary non-iterative (and thus less-accurate) 3D NUFFT-based MRI reconstructions

    Streaming Architectures for Medical Image Reconstruction

    Full text link
    Non-invasive imaging modalities have recently seen increased use in clinical diagnostic procedures. Unfortunately, emerging computational imaging techniques, such as those found in 3D ultrasound and iterative magnetic resonance imaging (MRI), are severely limited by the high computational requirements and poor algorithmic efficiency in current arallel hardware---often leading to significant delays before a doctor or technician can review the image, which can negatively impact patients in need of fast, highly accurate diagnosis. To make matters worse, the high raw data bandwidth found in 3D ultrasound requires on-chip volume reconstruction with a tight power dissipation budget---dissipation of more than 5~W may burn the skin of the patient. The tight power constraints and high volume rates required by emerging applications require orders of magnitude improvement over state-of-the-art systems in terms of both reconstruction time and energy efficiency. The goal of the research outlined in this dissertation is to reduce the time and energy required to perform medical image reconstruction through software/hardware co-design. By analyzing algorithms with a hardware-centric focus, we develop novel algorithmic improvements which simultaneously reduce computational requirements and map more efficiently to traditional hardware architectures. We then design and implement hardware accelerators which push the new algorithms to their full potential. In the first part of this dissertation, we characterize the performance bottlenecks of high-volume-rate 3D ultrasound imaging. By analyzing the 3D plane-wave ultrasound algorithm, we reduce computational and storage requirements in Delay Compression. Delay Compression recognizes additional symmetry in the planar transmission scheme found in 2D, 3D, and 3D-Separable plane-wave ultrasound implementations, enabling on-chip storage of the reconstruction constants for the first time and eliminating the ost power-intensive component of the reconstruction process. We then design and implement Tetris, a streaming hardware accelerator for 3D-Separable plane-wave ultrasound. Tetris is enabled by the Tetris Reserveration Station, a novel 2D register file that buffers incomplete voxels and eliminates the need for a traditional load-and-store memory interface. Utilizing a fully pipelined architecture, Tetris reconstructs volumes at physics-limited rates (i.e., limited by the physical propagation speed of sound through tissue). Next, we review a core component of several computational imaging modalities, the Non-uniform Fast Fourier Transform (NuFFT), focusing on its use in MRI reconstruction. We find that the non-uniform interpolation step therein requires over 99% of the reconstruction time due to poor spatial and temporal memory locality. While prior work has made great strides in improving the performance of the NuFFT, the most common algorithmic optimization severely limits the available parallelism, causing it to map poorly to the massively parallel processing available in modern GPUs and FPGAs. To this end, we create Slice-and-Dice, a processing model which enables efficient mapping of the NuFFT's most computationally-intensive component onto traditional parallel architectures. We then demonstrate the full acceleration potential of Slice-and-Dice with Jigsaw, a custom hardware accelerator which performs the non-uniform interpolations found in the NuFFT in time approximately linear in the number of non-uniform samples, rrespective of sampling pattern, uniform grid size, or interpolation kernel width. The algorithms and architectures herein enable faster, more efficient medical image reconstruction, without sacrificing image quality. By decreasing the time and energy required for image reconstruction, our work opens the door for future exploration into higher-resolution imaging and emerging, computationally complex reconstruction algorithms which improve the speed and quality of patient diagnosis.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/167986/1/westbl_1.pd

    Contributions to the design of Fourier-optical modulation systems based on micro-opto-electro-mechanical tilt-mirror arrays

    Get PDF
    Spatial Light Modulators (SLMs) based on Micro-Opto-Electro-Mechanical Systems (MOEMS) are increasingly being used in various fields of optics and enable novel functionalities. The technology features frame rates from a few kHz to the MHz range as well as resolutions in the megapixel range. The field continues to make rapid progress, but technological advancements are always associated with high expenditure. Against this background, this dissertation addresses the question: What contribution can optical system design make to the further development of MOEMS-SLM-based modulation? A lens is a simple example of an optical system. This dissertation deals with system design based on Fourier optics in which the wave properties of light are exploited. On this basis, arrays of micromirrors can modulate light properties in a spatially resolved manner. For example, tilt-mirrors can control the intensity distribution in an image plane. In this dissertation variations of the aperture required for this are investigated. In addition to known absorbing apertures, phase filters in particular are investigated, which apply a spatially distributed delay effect to the light wave. This dissertation proposes the combination of MOEMS-SLMs with static, pixelated elements in the same system. These may be pixelated phase masks, also known as diffractive optical elements (DOEs). Analogously, pixelated polarizer arrays and absorbing photomasks exist. The combination of SLMs and static elements allows new degrees of freedom in system design. This thesis proposes new modulation systems based on MOEMS tilt-mirror SLMs. These systems use analog tilt-mirror arrays for the simultaneous modulation of intensity and phase as well as intensity and polarization. The proposed systems thus open up new possibilities for MOEMS-based spatial light modulation. Their properties are validated and investigated by numerical simulations. System properties and limitations are derived from these near and far field simulations. This dissertation shows that the modulation of different MOEMS-SLM types can be fundamentally changed by system design. Piston mirror arrays are classically used for phase modulation and tilt-mirror arrays for intensity modulation. This thesis proposes the use of subpixel phase structures. Their use approximately provides tilt-mirrors with the phase-modulating effect of piston-mirrors. In order to achieve this, a new optimization method is presented. Piston-mirror arrays are available only to a limited extent. By contrast, tilt-mirror arrays are well established. In combination with subpixel phase features, tilt-mirrors may replace piston-mirrors in some applications. These and other challenges of MOEMS-SLM technology can be adequately addressed on the basis of system design.Räumliche Lichtmodulatoren (Spatial Light Modulators, SLMs) auf Basis von Mikro-Opto-Elektro-Mechanischen Systemen (MOEMS) finden zunehmend Anwendung in verschiedensten Teilgebieten der Optik und ermöglichen neuartige Funktionalitäten. Die Technik ermöglicht Frameraten von einigen kHz bis in den MHz-Bereich sowie Auflösungen bis in den Megapixelbereich. Der Fachbereich macht nach wie vor rasche Fortschritte, technologische Weiterentwicklungen sind aber stets mit hohem Aufwand verbunden. Vor diesem Hintergrund widmet sich diese Arbeit der Frage: Welchen Beitrag kann optisches Systemdesign zur Weiterentwicklung der MOEMS-SLM-basierten Modulation leisten? Bereits eine Linse stellt ein Beispiel für ein optisches System dar. Diese Dissertation beschäftigt sich mit Systemdesign auf Basis der Fourier-Optik, bei der die Welleneigenschaften des Lichts genutzt werden. Auf dieser Basis können Arrays von Mikrospiegeln die flächige Verteilung von Licht einstellen. Beispielsweise können Kippspiegel die Intensitätsverteilung in einer Bildebene steuern. In dieser Dissertation werden Variationen der dafür nötigen Apertur untersucht. Neben bekannten absorbierenden Blenden werden insbesondere Phasenfilter untersucht, welche eine flächig verteilte Verzögerungswirkung auf die Lichtwelle aufbringen. Diese Dissertation schlägt die Kombination von MOEMS-SLMs mit statischen, pixelierten Elementen im selben System vor. Hierbei kann es sich um pixelierte Phasenmasken handeln, auch bekannt als diffraktive optische Elemente (DOEs). Analog existieren pixelierte Polarisatorarrays und absorbierende Fotomasken. Die Kombination von SLMs und statischen Elementen ermöglicht neue Freiheiten im Systemdesign. Diese Arbeit schlägt neue Modulationssysteme auf Basis von MOEMS-Kippspiegel-SLMs vor. Diese Systeme nutzen analoge Kippspiegelarrays für die simultane Modulation von Intensität und Phase sowie von Intensität und Polarisation. Die vorgeschlagenen Systeme eröffnen damit neue Möglichkeiten für die MOEMS-basierte Flächenlichtmodulation. Ihre Eigenschaften werden mithilfe von numerischen Simulationen validiert und untersucht. Aus diesen Nah- und Fernfeldsimulationen werden Systemeigenschaften und Limitierungen abgeleitet. Es wird in dieser Arbeit gezeigt, dass die Modulation verschiedener MOEMS-SLM-Typen auf Basis des Systementwurfs fundamental verändert werden kann. Senkspiegelarrays werden klassischerweise zur Modulation der Phase eingesetzt und Kippspiegelarrays zur Modulation der Intensität. Diese Arbeit schlägt die Nutzung von Subpixel-Phasenstrukturen vor. Diese verleihen Kippspiegeln näherungsweise die phasenmodulierende Wirkung von Senkspiegeln. Um dies zu erreichen, wird ein neuartiges Optimierungsverfahren vorgestellt. Senkspiegelarrays sind nur in geringem Umfang verfügbar. Im Gegensatz dazu sind Kippspiegelarrays gut etabliert. In Kombination mit Subpixel-Phasenstrukturen könnten Kippspiegel in einigen Anwendungen Senkspiegel ersetzen. Diese und andere Herausforderungen der MOEMS-SLM-Technologie lassen sich auf der Grundlage des Systemdesigns adäquat adressieren

    Multiplierless CSD techniques for high performance FPGA implementation of digital filters.

    Get PDF
    I leverage FastCSD to develop a new, high performance iterative multiplierless structure based on a novel real-time CSD recoding, so that more zero partial products are introduced. Up to 66.7% zero partial products occur compared to 50% in the traditional modified Booth's recoding. Also, this structure reduces the non-zero partial products to a minimum. As a result, the number of arithmetic operations in the carry-save structure is reduced. Thus, an overall speed-up, as well as low-power consumption can be achieved. Furthermore, because the proposed structure involves real time CSD recoding and does not require a fixed value for the multiplier input to be known a priori, the proposed multiplier can be applied to implement digital filters with non-fixed filter coefficients, such as adaptive filters.My work is based on a dramatic new technique for converting between 2's complement and CSD number systems, and results in high-performance structures that are particularly effective for implementing adaptive systems in reconfigurable logic.My research focus is on two key ideas for improving DSP performance: (1) Develop new high performance, efficient shift-add techniques ("multiplierless") to implement the multiply-add operations without the need for a traditional multiplier structure. (2) There is a growing trend toward design prototyping and even production in FPGAs as opposed to dedicated DSP processors or ASICs; leverage this trend synergistically with the new multiplierless structures to improve performance.Implementation of digital signal processing (DSP) algorithms in hardware, such as field programmable gate arrays (FPGAs), requires a large number of multipliers. Fast, low area multiply-adds have become critical in modern commercial and military DSP applications. In many contemporary real-time DSP and multimedia applications, system performance is severely impacted by the limitations of currently available speed, energy efficiency, and area requirement of an onboard silicon multiplier.I also introduce a new multi-input Canonical Signed Digit (CSD) multiplier unit, which requires fewer shift/add/subtract operations and reduced CSD number conversion overhead compared to existing techniques. This results in reduced power consumption and area requirements in the hardware implementation of DSP algorithms. Furthermore, because all the products are produced simultaneously, the multiplication speed and thus the throughput are improved. The multi-input multiplier unit is applied to implement digital filters with non-fixed filter coefficients, such as adaptive filters. The implementation cost of these digital filters can be further reduced by limiting the wordlength of the input signal with little or no sacrifice to the filter performance, which is confirmed by my simulation results. The proposed multiplier unit can also be applied to other DSP algorithms, such as digital filter banks or matrix and vector multiplications.Finally, the tradeoff between filter order and coefficient length in the design and implementation of high-performance filters in Field Programmable Gate Arrays (FPGAs) is discussed. Non-minimum order FIR filters are designed for implementation using Canonical Signed Digit (CSD) multiplierless implementation techniques. By increasing the filter order, the length of the coefficients can be decreased without reducing the filter performance. Thus, an overall hardware savings can be achieved.Adaptive system implementations require real-time conversion of coefficients to Canonical Signed Digit (CSD) or similar representations to benefit from multiplierless techniques for implementing filters. Multiplierless approaches are used to reduce the hardware and increase the throughput. This dissertation introduces the first non-iterative hardware algorithm to convert 2's complement numbers to their CSD representations (FastCSD) using a fixed number of shift and logic operations. As a result, the power consumption and area requirements required for hardware implementation of DSP algorithms in which the coefficients are not known a priori can be greatly reduced. Because all CSD digits are produced simultaneously, the conversion speed and thus the throughput are improved when compared to overlap-and-scan techniques such as Booth's recoding

    Increasing the Efficiency of Doppler Processing and Backend Processing in Medical Ultrasound Systems

    Get PDF
    abstract: Ultrasound imaging is one of the major medical imaging modalities. It is cheap, non-invasive and has low power consumption. Doppler processing is an important part of many ultrasound imaging systems. It is used to provide blood velocity information and is built on top of B-mode systems. We investigate the performance of two velocity estimation schemes used in Doppler processing systems, namely, directional velocity estimation (DVE) and conventional velocity estimation (CVE). We find that DVE provides better estimation performance and is the only functioning method when the beam to flow angle is large. Unfortunately, DVE is computationally expensive and also requires divisions and square root operations that are hard to implement. We propose two approximation techniques to replace these computations. The simulation results on cyst images show that the proposed approximations do not affect the estimation performance. We also study backend processing which includes envelope detection, log compression and scan conversion. Three different envelope detection methods are compared. Among them, FIR based Hilbert Transform is considered the best choice when phase information is not needed, while quadrature demodulation is a better choice if phase information is necessary. Bilinear and Gaussian interpolation are considered for scan conversion. Through simulations of a cyst image, we show that bilinear interpolation provides comparable contrast-to-noise ratio (CNR) performance with Gaussian interpolation and has lower computational complexity. Thus, bilinear interpolation is chosen for our system.Dissertation/ThesisM.S. Electrical Engineering 201

    Parallelization of dynamic programming recurrences in computational biology

    Get PDF
    The rapid growth of biosequence databases over the last decade has led to a performance bottleneck in the applications analyzing them. In particular, over the last five years DNA sequencing capacity of next-generation sequencers has been doubling every six months as costs have plummeted. The data produced by these sequencers is overwhelming traditional compute systems. We believe that in the future compute performance, not sequencing, will become the bottleneck in advancing genome science. In this work, we investigate novel computing platforms to accelerate dynamic programming algorithms, which are popular in bioinformatics workloads. We study algorithm-specific hardware architectures that exploit fine-grained parallelism in dynamic programming kernels using field-programmable gate arrays: FPGAs). We advocate a high-level synthesis approach, using the recurrence equation abstraction to represent dynamic programming and polyhedral analysis to exploit parallelism. We suggest a novel technique within the polyhedral model to optimize for throughput by pipelining independent computations on an array. This design technique improves on the state of the art, which builds latency-optimal arrays. We also suggest a method to dynamically switch between a family of designs using FPGA reconfiguration to achieve a significant performance boost. We have used polyhedral methods to parallelize the Nussinov RNA folding algorithm to build a family of accelerators that can trade resources for parallelism and are between 15-130x faster than a modern dual core CPU implementation. A Zuker RNA folding accelerator we built on a single workstation with four Xilinx Virtex 4 FPGAs outperforms 198 3 GHz Intel Core 2 Duo processors. Furthermore, our design running on a single FPGA is an order of magnitude faster than competing implementations on similar-generation FPGAs and graphics processors. Our work is a step toward the goal of automated synthesis of hardware accelerators for dynamic programming algorithms

    Architectural explorations for streaming accelerators with customized memory layouts

    Get PDF
    El concepto básico de la arquitectura mono-nucleo en los procesadores de propósito general se ajusta bien a un modelo de programación secuencial. La integración de multiples núcleos en un solo chip ha permitido a los procesadores correr partes del programa en paralelo. Sin embargo, la explotación del enorme paralelismo disponible en muchas aplicaciones de alto rendimiento y de los datos correspondientes es difícil de conseguir usando unicamente multicores de propósito general. La aparición de aceleradores tipo streaming y de los correspondientes modelos de programación han mejorado esta situación proporcionando arquitecturas orientadas al proceso de flujos de datos. La idea básica detrás del diseño de estas arquitecturas responde a la necesidad de procesar conjuntos enormes de datos. Estos dispositivos de alto rendimiento orientados a flujos permiten el procesamiento rapido de datos mediante el uso eficiente de computación paralela y comunicación entre procesos. Los aceleradores streaming orientados a flujos, igual que en otros procesadores, consisten en diversos componentes micro-arquitectonicos como por ejemplo las estructuras de memoria, las unidades de computo, las unidades de control, los canales de Entrada/Salida y controles de Entrada/Salida, etc. Sin embargo, los requisitos del flujo de datos agregan algunas características especiales e imponen otras restricciones que afectan al rendimiento. Estos dispositivos, por lo general, ofrecen un gran número de recursos computacionales, pero obligan a reorganizar los conjuntos de datos en paralelo, maximizando la independiencia para alimentar los recursos de computación en forma de flujos. La disposición de datos en conjuntos independientes de flujos paralelos no es una tarea sencilla. Es posible que se tenga que cambiar la estructura de un algoritmo en su conjunto o, incluso, puede requerir la reescritura del algoritmo desde cero. Sin embargo, todos estos esfuerzos para la reordenación de los patrones de las aplicaciones de acceso a datos puede que no sean muy útiles para lograr un rendimiento óptimo. Esto es debido a las posibles limitaciones microarquitectonicas de la plataforma de destino para los mecanismos hardware de prefetch, el tamaño y la granularidad del almacenamiento local, y la flexibilidad para disponer de forma serial los datos en el interior del almacenamiento local. Las limitaciones de una plataforma de streaming de proposito general para el prefetching de datos, almacenamiento y demas procedimientos para organizar y mantener los datos en forma de flujos paralelos e independientes podría ser eliminado empleando técnicas a nivel micro-arquitectonico. Esto incluye el uso de memorias personalizadas especificamente para las aplicaciones en el front-end de una arquitectura streaming. El objetivo de esta tesis es presentar exploraciones arquitectónicas de los aceleradores streaming con diseños de memoria personalizados. En general, la tesis cubre tres aspectos principales de tales aceleradores. Estos aspectos se pueden clasificar como: i) Diseño de aceleradores de aplicaciones específicas con diseños de memoria personalizados, ii) diseño de aceleradores con memorias personalizadas basados en plantillas, y iii) exploraciones del espacio de diseño para dispositivos orientados a flujos con las memorias estándar y personalizadas. Esta tesis concluye con la propuesta conceptual de una Blacksmith Streaming Architecture (BSArc). El modelo de computación Blacksmith permite la adopción a nivel de hardware de un front-end de aplicación específico utilizando una GPU como back-end. Esto permite maximizar la explotación de la localidad de datos y el paralelismo a nivel de datos de una aplicación mientras que proporciona un flujo mayor de datos al back-end. Consideramos que el diseño de estos procesadores con memorias especializadas debe ser proporcionado por expertos del dominio de aplicación en la forma de plantillas.The basic concept behind the architecture of a general purpose CPU core conforms well to a serial programming model. The integration of more cores on a single chip helped CPUs in running parts of a program in parallel. However, the utilization of huge parallelism available from many high performance applications and the corresponding data is hard to achieve from these general purpose multi-cores. Streaming accelerators and the corresponding programing models improve upon this situation by providing throughput oriented architectures. The basic idea behind the design of these architectures matches the everyday increasing requirements of processing huge data sets. These high-performance throughput oriented devices help in high performance processing of data by using efficient parallel computations and streaming based communications. The throughput oriented streaming accelerators ¿ similar to the other processors ¿ consist of numerous types of micro-architectural components including the memory structures, compute units, control units, I/O channels and I/O controls etc. However, the throughput requirements add some special features and impose other restrictions for the performance purposes. These devices, normally, offer a large number of compute resources but restrict the applications to arrange parallel and maximally independent data sets to feed the compute resources in the form of streams. The arrangement of data into independent sets of parallel streams is not an easy and simple task. It may need to change the structure of an algorithm as a whole or even it can require to write a new algorithm from scratch for the target application. However, all these efforts for the re-arrangement of application data access patterns may still not be very helpful to achieve the optimal performance. This is because of the possible micro-architectural constraints of the target platform for the hardware pre-fetching mechanisms, the size and the granularity of the local storage and the flexibility in data marshaling inside the local storage. The constraints of a general purpose streaming platform on the data pre-fetching, storing and maneuvering to arrange and maintain it in the form of parallel and independent streams could be removed by employing micro-architectural level design approaches. This includes the usage of application specific customized memories in the front-end of a streaming architecture. The focus of this thesis is to present architectural explorations for the streaming accelerators using customized memory layouts. In general the thesis covers three main aspects of such streaming accelerators in this research. These aspects can be categorized as : i) Design of Application Specific Accelerators with Customized Memory Layout ii) Template Based Design Support for Customized Memory Accelerators and iii) Design Space Explorations for Throughput Oriented Devices with Standard and Customized Memories. This thesis concludes with a conceptual proposal on a Blacksmith Streaming Architecture (BSArc). The Blacksmith Computing allow the hardware-level adoption of an application specific front-end with a GPU like streaming back-end. This gives an opportunity to exploit maximum possible data locality and the data level parallelism from an application while providing a throughput natured powerful back-end. We consider that the design of these specialized memory layouts for the front-end of the device are provided by the application domain experts in the form of templates. These templates are adjustable according to a device and the problem size at the device's configuration time. The physical availability of such an architecture may still take time. However, simulation framework helps in architectural explorations to give insight into the proposal and predicts potential performance benefits for such an architecture.Postprint (published version
    corecore