174 research outputs found

    High-level power optimisation for Digital Signal Processing in Recon gurable Logic

    No full text
    This thesis is concerned with the optimisation of Digital Signal Processing (DSP) algorithm implementations on recon gurable hardware via the selection of appropriate word-lengths for the signals in these algorithms, in order to minimise system power consumption. Whilst existing word-length optimisation work has concentrated on the minimisation of the area of algorithm implementations, this work introduces the rst set of power consumption models that can be evaluated quickly enough to be used within the search of the enormous design space of multiple word-length optimisation problems. These models achieve their speed by estimating both the power consumed within the arithmetic components of an algorithm and the power in the routing wires that connect these components, using only a high-level description of the algorithm itself. Trading o a small reduction in power model accuracy for a large increase in speed is one of the major contributions of this thesis. In addition to the work on power consumption modelling, this thesis also develops a new technique for selecting the appropriate word-lengths for an algorithm implementation in order to minimise its cost in terms of power (or some other metric for which models are available). The method developed is able to provide tight lower and upper bounds on the optimal cost that can be obtained for a particular word-length optimisation problem and can, as a result, nd provably near-optimal solutions to word-length optimisation problems without resorting to an NP-hard search of the design space. Finally the costs of systems optimised via the proposed technique are compared to those obtainable by word-length optimisation for minimisation of other metrics (such as logic area) and the results compared, providing greater insight into the nature of wordlength optimisation problems and the extent of the improvements obtainable by them

    Hybrid DDS-PLL based reconfigurable oscillators with high spectral purity for cognitive radio

    Get PDF
    Analytical, design and simulation studies on the performance optimization of reconfigurable architecture of a Hybrid DDS – PLL are presented in this thesis. The original contributions of this thesis are aimed towards the DDS, the dithering (spur suppression) scheme and the PLL. A new design of Taylor series-based DDS that reduces the dynamic power and number of multipliers is a significant contribution of this thesis. This thesis compares dynamic power and SFDR achieved in the design of varieties of DDS such as Quartic, Cubic, Linear and LHSC. This thesis proposes two novel schemes namely “Hartley Image Suppression” and “Adaptive Sinusoidal Interference Cancellation” overcoming the low noise floor of traditional dithering schemes. The simulation studies on a Taylor series-based DDS reveal an improvement in SFDR from 74 dB to 114 dB by using Least Mean Squares -Sinusoidal Interference Canceller (LM-SIC) with the noise floor maintained at -200 dB. Analytical formulations have been developed for a second order PLL to relate the phase noise to settling time and Phase Margin (PM) as well as to relate jitter variance and PM. New expressions relating phase noise to PM and lock time to PM are derived. This thesis derives the analytical relationship between the roots of the characteristic equation of a third order PLL and its performance metrics like PM, Gardner’s stability factor, jitter variance, spur gain and ratio of noise power to carrier power. This thesis presents an analysis to relate spur gain and capacitance ratio of a third order PLL. This thesis presents an analytical relationship between the lock time and the roots of its characteristic equation of a third order PLL. Through Vieta’s circle and Vieta’s angle, the performance metrics of a third order PLL are related to the real roots of its characteristic equation

    Integrated Microwave Photonic Processors using Waveguide Mesh Cores

    Full text link
    Integrated microwave photonics changes the scaling laws of information and communication systems offering architectural choices that combine photonics with electronics to optimize performance, power, footprint and cost. Application Specific Photonic Integrated Circuits, where particular circuits/chips are designed to optimally perform particular functionalities, require a considerable number of design and fabrication iterations leading to long-development times and costly implementations. A different approach inspired by electronic Field Programmable Gate Arrays is the programmable Microwave Photonic processor, where a common hardware implemented by the combination of microwave, photonic and electronic subsystems, realizes different functionalities through programming. Here, we propose the first-ever generic-purpose Microwave Photonic processor concept and architecture. This versatile processor requires a powerful end-to-end field-based analytical model to optimally configure all their subsystems as well as to evaluate their performance in terms of the radiofrequency gain, noise and dynamic range. Therefore, we develop a generic model for integrated Microwave Photonics systems. The key element of the processor is the reconfigurable optical core. It requires high flexibility and versatility to enable reconfigurable interconnections between subsystems as well as the synthesis of photonic integrated circuits. For this element, we focus on a 2-dimensional photonic waveguide mesh based on the interconnection of tunable couplers. Within the framework of this Thesis, we have proposed two novel interconnection schemes, aiming for a mesh design with a high level of versatility. Focusing on the hexagonal waveguide mesh, we explore the synthesis of a high variety of photonic integrated circuits and particular Microwave Photonics applications that can potentially be performed on a single hardware. In addition, we report the first-ever demonstration of such reconfigurable waveguide mesh in silicon. We demonstrate a world-record number of functionalities on a single photonic integrated circuit enabling over 30 different functionalities from the 100 that could be potentially obtained with a simple seven hexagonal cell structure. The resulting device can be applied to different fields including communications, chemical and biomedical sensing, signal processing, multiprocessor networks as well as quantum information systems. Our work is an important step towards this paradigm and sets the base for a new era of generic-purpose photonic integrated systems.Los dispositivos integrados de fotónica de microondas ofrecen soluciones optimizadas para los sistemas de información y comunicación. Generalmente, están compuestos por diferentes arquitecturas en las que subsistemas ópticos y electrónicos se integran para optimizar las prestaciones, el consumo, el tamaño y el coste del dispositivo final. Hasta ahora, los circuitos/chips de propósito específico se han diseñado para proporcionar una funcionalidad concreta, requiriendo así un número considerable de iteraciones entre las etapas de diseño, fabricación y medida, que origina tiempos de desarrollo largos y costes demasiado elevados. Una alternativa, inspirada por las FPGA (del inglés Field Programmable Gate Array), es el procesador fotónico programable. Este dispositivo combina la integración de subsistemas de microondas, ópticos y electrónicos para realizar, mediante la programación de los mismos y sus interconexiones, diferentes funcionalidades. En este trabajo, proponemos por primera vez el concepto del procesador de propósito general, así como su arquitectura. Además, con el fin de diseñar, optimizar y evaluar las prestaciones básicas del dispositivo, hemos desarrollado un modelo analítico extremo a extremo basado en las componentes del campo electromagnético. El modelo desarrollado proporciona como resultado la ganancia, el ruido y el rango dinámico global para distintas configuraciones de modulación y detección, en función de los subsistemas y su configuración. El elemento principal del procesador es su núcleo óptico reconfigurable. Éste requiere un alto grado de flexibilidad y versatilidad para reconfigurar las interconexiones entre los distintos subsistemas y para sintetizar los circuitos para el procesado óptico. Para este subsistema, proponemos el diseño de guías de onda reconfigurables para la creación de mallados bidimensionales. En el marco de esta tesis, hemos propuesto dos nuevos nodos de interconexión óptica para mallas reconfigurables, con el objetivo de obtener un mayor grado de versatilidad. Una vez escogida la malla hexagonal para el núcleo del procesador, hemos analizado la configuración de un gran número de circuitos fotónicos integrados y de funcionalidades de fotónica de microondas. El trabajo se ha completado con la demonstración de la primera malla reconfigurable integrada en un chip de silicio, demostrando además la síntesis de 30 de las 100 funcionalidades que potencialmente se pueden obtener con la malla diseñada compuesta de 7 celdas hexagonales. Este hecho supone un record frente a los sistemas de propósito específico. El sistema puede aplicarse en diferentes campos como las comunicaciones, los sensores químicos y biomédicos, el procesado de señales, la gestión y procesamiento de redes y los sistemas de información cuánticos. El conjunto del trabajo realizado representa un paso importante en la evolución de este paradigma, y sienta las bases para una nueva era de dispositivos fotónicos de propósito general.Els dispositius integrats de Fotònica de Microones oferixen solucions optimitzades per als sistemes d'informació i comunicació. Generalment, estan compostos per diferents arquitectures en què subsistemes òptics i electrònics s'integren per a optimitzar les prestacions, el consum, la grandària i el cost del dispositiu final. Fins ara, els circuits/xips de propòsit específic s'han dissenyat per a proporcionar una funcionalitat concreta, requerint així un nombre considerable d'iteracions entre les etapes de disseny, fabricació i mesura, que origina temps de desenrotllament llargs i costos massa elevats. Una alternativa, inspirada per les FPGA (de l'anglés Field Programmable Gate Array), és el processador fotònic programable. Este dispositiu combina la integració de subsistemes de microones, òptics i electrònics per a realitzar, per mitjà de la programació dels mateixos i les seues interconnexions, diferents funcionalitats. En este treball proposem per primera vegada el concepte del processador de propòsit general, així com la seua arquitectura. A més, a fi de dissenyar, optimitzar i avaluar les prestacions bàsiques del dispositiu, hem desenrotllat un model analític extrem a extrem basat en els components del camp electromagnètic. El model desenrotllat proporciona com resultat el guany, el soroll i el rang dinàmic global per a distintes configuracions de modulació i detecció, en funció dels subsistemes i la seua configuració. L'element principal del processador és el seu nucli òptic reconfigurable. Este requerix un alt grau de flexibilitat i versatilitat per a reconfigurar les interconnexions entre els distints subsistemes i per a sintetitzar els circuits per al processat òptic. Per a este subsistema, proposem el disseny de guies d'onda reconfigurables per a la creació de mallats bidimensionals. En el marc d'esta tesi, hem proposat dos nous nodes d'interconnexió òptica per a malles reconfigurables, amb l'objectiu d'obtindre un major grau de versatilitat. Una vegada triada la malla hexagonal per al nucli del processador, hem analitzat la configuració d'un gran nombre de circuits fotónicos integrats i de funcionalitats de fotónica de microones. El treball s'ha completat amb la demostració de la primera malla reconfigurable integrada en un xip de silici, demostrant a més la síntesi de 30 de les 100 funcionalitats que potencialment es poden obtindre amb la malla dissenyada composta de 7 cèl·lules hexagonals. Este fet suposa un rècord enfront dels sistemes de propòsit específic. El sistema pot aplicarse en diferents camps com les comunicacions, els sensors químics i biomèdics, el processat de senyals, la gestió i processament de xarxes i els sistemes d'informació quàntics. El conjunt del treball realitzat representa un pas important en l'evolució d'este paradigma, i assenta les bases per a una nova era de dispositius fotónicos de propòsit general.Pérez López, D. (2017). Integrated Microwave Photonic Processors using Waveguide Mesh Cores [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/91232TESI

    Survey of FPGA applications in the period 2000 – 2015 (Technical Report)

    Get PDF
    Romoth J, Porrmann M, Rückert U. Survey of FPGA applications in the period 2000 – 2015 (Technical Report).; 2017.Since their introduction, FPGAs can be seen in more and more different fields of applications. The key advantage is the combination of software-like flexibility with the performance otherwise common to hardware. Nevertheless, every application field introduces special requirements to the used computational architecture. This paper provides an overview of the different topics FPGAs have been used for in the last 15 years of research and why they have been chosen over other processing units like e.g. CPUs

    Discrete Wavelet Transforms

    Get PDF
    The discrete wavelet transform (DWT) algorithms have a firm position in processing of signals in several areas of research and industry. As DWT provides both octave-scale frequency and spatial timing of the analyzed signal, it is constantly used to solve and treat more and more advanced problems. The present book: Discrete Wavelet Transforms: Algorithms and Applications reviews the recent progress in discrete wavelet transform algorithms and applications. The book covers a wide range of methods (e.g. lifting, shift invariance, multi-scale analysis) for constructing DWTs. The book chapters are organized into four major parts. Part I describes the progress in hardware implementations of the DWT algorithms. Applications include multitone modulation for ADSL and equalization techniques, a scalable architecture for FPGA-implementation, lifting based algorithm for VLSI implementation, comparison between DWT and FFT based OFDM and modified SPIHT codec. Part II addresses image processing algorithms such as multiresolution approach for edge detection, low bit rate image compression, low complexity implementation of CQF wavelets and compression of multi-component images. Part III focuses watermaking DWT algorithms. Finally, Part IV describes shift invariant DWTs, DC lossless property, DWT based analysis and estimation of colored noise and an application of the wavelet Galerkin method. The chapters of the present book consist of both tutorial and highly advanced material. Therefore, the book is intended to be a reference text for graduate students and researchers to obtain state-of-the-art knowledge on specific applications

    Architectural explorations for streaming accelerators with customized memory layouts

    Get PDF
    El concepto básico de la arquitectura mono-nucleo en los procesadores de propósito general se ajusta bien a un modelo de programación secuencial. La integración de multiples núcleos en un solo chip ha permitido a los procesadores correr partes del programa en paralelo. Sin embargo, la explotación del enorme paralelismo disponible en muchas aplicaciones de alto rendimiento y de los datos correspondientes es difícil de conseguir usando unicamente multicores de propósito general. La aparición de aceleradores tipo streaming y de los correspondientes modelos de programación han mejorado esta situación proporcionando arquitecturas orientadas al proceso de flujos de datos. La idea básica detrás del diseño de estas arquitecturas responde a la necesidad de procesar conjuntos enormes de datos. Estos dispositivos de alto rendimiento orientados a flujos permiten el procesamiento rapido de datos mediante el uso eficiente de computación paralela y comunicación entre procesos. Los aceleradores streaming orientados a flujos, igual que en otros procesadores, consisten en diversos componentes micro-arquitectonicos como por ejemplo las estructuras de memoria, las unidades de computo, las unidades de control, los canales de Entrada/Salida y controles de Entrada/Salida, etc. Sin embargo, los requisitos del flujo de datos agregan algunas características especiales e imponen otras restricciones que afectan al rendimiento. Estos dispositivos, por lo general, ofrecen un gran número de recursos computacionales, pero obligan a reorganizar los conjuntos de datos en paralelo, maximizando la independiencia para alimentar los recursos de computación en forma de flujos. La disposición de datos en conjuntos independientes de flujos paralelos no es una tarea sencilla. Es posible que se tenga que cambiar la estructura de un algoritmo en su conjunto o, incluso, puede requerir la reescritura del algoritmo desde cero. Sin embargo, todos estos esfuerzos para la reordenación de los patrones de las aplicaciones de acceso a datos puede que no sean muy útiles para lograr un rendimiento óptimo. Esto es debido a las posibles limitaciones microarquitectonicas de la plataforma de destino para los mecanismos hardware de prefetch, el tamaño y la granularidad del almacenamiento local, y la flexibilidad para disponer de forma serial los datos en el interior del almacenamiento local. Las limitaciones de una plataforma de streaming de proposito general para el prefetching de datos, almacenamiento y demas procedimientos para organizar y mantener los datos en forma de flujos paralelos e independientes podría ser eliminado empleando técnicas a nivel micro-arquitectonico. Esto incluye el uso de memorias personalizadas especificamente para las aplicaciones en el front-end de una arquitectura streaming. El objetivo de esta tesis es presentar exploraciones arquitectónicas de los aceleradores streaming con diseños de memoria personalizados. En general, la tesis cubre tres aspectos principales de tales aceleradores. Estos aspectos se pueden clasificar como: i) Diseño de aceleradores de aplicaciones específicas con diseños de memoria personalizados, ii) diseño de aceleradores con memorias personalizadas basados en plantillas, y iii) exploraciones del espacio de diseño para dispositivos orientados a flujos con las memorias estándar y personalizadas. Esta tesis concluye con la propuesta conceptual de una Blacksmith Streaming Architecture (BSArc). El modelo de computación Blacksmith permite la adopción a nivel de hardware de un front-end de aplicación específico utilizando una GPU como back-end. Esto permite maximizar la explotación de la localidad de datos y el paralelismo a nivel de datos de una aplicación mientras que proporciona un flujo mayor de datos al back-end. Consideramos que el diseño de estos procesadores con memorias especializadas debe ser proporcionado por expertos del dominio de aplicación en la forma de plantillas.The basic concept behind the architecture of a general purpose CPU core conforms well to a serial programming model. The integration of more cores on a single chip helped CPUs in running parts of a program in parallel. However, the utilization of huge parallelism available from many high performance applications and the corresponding data is hard to achieve from these general purpose multi-cores. Streaming accelerators and the corresponding programing models improve upon this situation by providing throughput oriented architectures. The basic idea behind the design of these architectures matches the everyday increasing requirements of processing huge data sets. These high-performance throughput oriented devices help in high performance processing of data by using efficient parallel computations and streaming based communications. The throughput oriented streaming accelerators ¿ similar to the other processors ¿ consist of numerous types of micro-architectural components including the memory structures, compute units, control units, I/O channels and I/O controls etc. However, the throughput requirements add some special features and impose other restrictions for the performance purposes. These devices, normally, offer a large number of compute resources but restrict the applications to arrange parallel and maximally independent data sets to feed the compute resources in the form of streams. The arrangement of data into independent sets of parallel streams is not an easy and simple task. It may need to change the structure of an algorithm as a whole or even it can require to write a new algorithm from scratch for the target application. However, all these efforts for the re-arrangement of application data access patterns may still not be very helpful to achieve the optimal performance. This is because of the possible micro-architectural constraints of the target platform for the hardware pre-fetching mechanisms, the size and the granularity of the local storage and the flexibility in data marshaling inside the local storage. The constraints of a general purpose streaming platform on the data pre-fetching, storing and maneuvering to arrange and maintain it in the form of parallel and independent streams could be removed by employing micro-architectural level design approaches. This includes the usage of application specific customized memories in the front-end of a streaming architecture. The focus of this thesis is to present architectural explorations for the streaming accelerators using customized memory layouts. In general the thesis covers three main aspects of such streaming accelerators in this research. These aspects can be categorized as : i) Design of Application Specific Accelerators with Customized Memory Layout ii) Template Based Design Support for Customized Memory Accelerators and iii) Design Space Explorations for Throughput Oriented Devices with Standard and Customized Memories. This thesis concludes with a conceptual proposal on a Blacksmith Streaming Architecture (BSArc). The Blacksmith Computing allow the hardware-level adoption of an application specific front-end with a GPU like streaming back-end. This gives an opportunity to exploit maximum possible data locality and the data level parallelism from an application while providing a throughput natured powerful back-end. We consider that the design of these specialized memory layouts for the front-end of the device are provided by the application domain experts in the form of templates. These templates are adjustable according to a device and the problem size at the device's configuration time. The physical availability of such an architecture may still take time. However, simulation framework helps in architectural explorations to give insight into the proposal and predicts potential performance benefits for such an architecture

    Static and reconfigurable devices for near-field and far-field terahertz applications

    Get PDF
    The terahertz frequency electromagnetic radiation has gathered a growing interest from the scientific and technological communities in the last 30 years, due to its capability to penetrate common materials, such as paper, fabrics, or some plastics and offer information on a length scale between 100 µm and 1 mm. Moreover, terahertz radiation can be employed for wireless communications, because it is able to sustain terabit-per-second wireless links, opening to the possibility of a new generation of data networks. However, the terahertz band is a challenging range of the electromagnetic spectrum in terms of technological development and it falls amidst the microwave and optical techniques. Even though this so-called “terahertz gap” is progressively narrowing, the demand of efficient terahertz sources and detectors, as well as passive components for the management of terahertz radiation, is still high. In fact, novel strategies are currently under investigation, aiming at improving the performance of terahertz devices and, at the same time, at reducing their structure complexity and fabrication costs. In this PhD work, two classes of devices are studied, one for near-field focusing and one for far-field radiation with high directivity. Some solutions for their practical implementation are presented. The first class encompasses several configurations of diffractive lenses for focusing terahertz radiation. A configuration for a terahertz diffractive lens is proposed, numerically optimized, and experimentally evaluated. It shows a better resolution than a standard configuration. Moreover, this lens is investigated with regard to the possibility to develop terahertz diffractive lenses with a tunable focal length by means of an electro-optical control. Preliminary numerical data present a dual-focus capability at terahertz frequencies. The second class encompasses advanced radiating systems for controlling the far-field radiating features at terahertz frequencies. These are designed by means of the formalism of leaky-wave theory. Specifically, the use of an electro-optical material is considered for the design of a leaky-wave antenna operating in the terahertz range, achieving very promising results in terms of reconfigurability, efficiency, and radiating capabilities. Furthermore, different metasurface topologies are studied. Their analytical and numerical investigation reveals a high directivity in radiating performance. Directions for the fabrication and experimental test at terahertz frequencies of the proposed radiating structures are addressed

    Architectural explorations for streaming accelerators with customized memory layouts

    Get PDF
    El concepto básico de la arquitectura mono-nucleo en los procesadores de propósito general se ajusta bien a un modelo de programación secuencial. La integración de multiples núcleos en un solo chip ha permitido a los procesadores correr partes del programa en paralelo. Sin embargo, la explotación del enorme paralelismo disponible en muchas aplicaciones de alto rendimiento y de los datos correspondientes es difícil de conseguir usando unicamente multicores de propósito general. La aparición de aceleradores tipo streaming y de los correspondientes modelos de programación han mejorado esta situación proporcionando arquitecturas orientadas al proceso de flujos de datos. La idea básica detrás del diseño de estas arquitecturas responde a la necesidad de procesar conjuntos enormes de datos. Estos dispositivos de alto rendimiento orientados a flujos permiten el procesamiento rapido de datos mediante el uso eficiente de computación paralela y comunicación entre procesos. Los aceleradores streaming orientados a flujos, igual que en otros procesadores, consisten en diversos componentes micro-arquitectonicos como por ejemplo las estructuras de memoria, las unidades de computo, las unidades de control, los canales de Entrada/Salida y controles de Entrada/Salida, etc. Sin embargo, los requisitos del flujo de datos agregan algunas características especiales e imponen otras restricciones que afectan al rendimiento. Estos dispositivos, por lo general, ofrecen un gran número de recursos computacionales, pero obligan a reorganizar los conjuntos de datos en paralelo, maximizando la independiencia para alimentar los recursos de computación en forma de flujos. La disposición de datos en conjuntos independientes de flujos paralelos no es una tarea sencilla. Es posible que se tenga que cambiar la estructura de un algoritmo en su conjunto o, incluso, puede requerir la reescritura del algoritmo desde cero. Sin embargo, todos estos esfuerzos para la reordenación de los patrones de las aplicaciones de acceso a datos puede que no sean muy útiles para lograr un rendimiento óptimo. Esto es debido a las posibles limitaciones microarquitectonicas de la plataforma de destino para los mecanismos hardware de prefetch, el tamaño y la granularidad del almacenamiento local, y la flexibilidad para disponer de forma serial los datos en el interior del almacenamiento local. Las limitaciones de una plataforma de streaming de proposito general para el prefetching de datos, almacenamiento y demas procedimientos para organizar y mantener los datos en forma de flujos paralelos e independientes podría ser eliminado empleando técnicas a nivel micro-arquitectonico. Esto incluye el uso de memorias personalizadas especificamente para las aplicaciones en el front-end de una arquitectura streaming. El objetivo de esta tesis es presentar exploraciones arquitectónicas de los aceleradores streaming con diseños de memoria personalizados. En general, la tesis cubre tres aspectos principales de tales aceleradores. Estos aspectos se pueden clasificar como: i) Diseño de aceleradores de aplicaciones específicas con diseños de memoria personalizados, ii) diseño de aceleradores con memorias personalizadas basados en plantillas, y iii) exploraciones del espacio de diseño para dispositivos orientados a flujos con las memorias estándar y personalizadas. Esta tesis concluye con la propuesta conceptual de una Blacksmith Streaming Architecture (BSArc). El modelo de computación Blacksmith permite la adopción a nivel de hardware de un front-end de aplicación específico utilizando una GPU como back-end. Esto permite maximizar la explotación de la localidad de datos y el paralelismo a nivel de datos de una aplicación mientras que proporciona un flujo mayor de datos al back-end. Consideramos que el diseño de estos procesadores con memorias especializadas debe ser proporcionado por expertos del dominio de aplicación en la forma de plantillas.The basic concept behind the architecture of a general purpose CPU core conforms well to a serial programming model. The integration of more cores on a single chip helped CPUs in running parts of a program in parallel. However, the utilization of huge parallelism available from many high performance applications and the corresponding data is hard to achieve from these general purpose multi-cores. Streaming accelerators and the corresponding programing models improve upon this situation by providing throughput oriented architectures. The basic idea behind the design of these architectures matches the everyday increasing requirements of processing huge data sets. These high-performance throughput oriented devices help in high performance processing of data by using efficient parallel computations and streaming based communications. The throughput oriented streaming accelerators ¿ similar to the other processors ¿ consist of numerous types of micro-architectural components including the memory structures, compute units, control units, I/O channels and I/O controls etc. However, the throughput requirements add some special features and impose other restrictions for the performance purposes. These devices, normally, offer a large number of compute resources but restrict the applications to arrange parallel and maximally independent data sets to feed the compute resources in the form of streams. The arrangement of data into independent sets of parallel streams is not an easy and simple task. It may need to change the structure of an algorithm as a whole or even it can require to write a new algorithm from scratch for the target application. However, all these efforts for the re-arrangement of application data access patterns may still not be very helpful to achieve the optimal performance. This is because of the possible micro-architectural constraints of the target platform for the hardware pre-fetching mechanisms, the size and the granularity of the local storage and the flexibility in data marshaling inside the local storage. The constraints of a general purpose streaming platform on the data pre-fetching, storing and maneuvering to arrange and maintain it in the form of parallel and independent streams could be removed by employing micro-architectural level design approaches. This includes the usage of application specific customized memories in the front-end of a streaming architecture. The focus of this thesis is to present architectural explorations for the streaming accelerators using customized memory layouts. In general the thesis covers three main aspects of such streaming accelerators in this research. These aspects can be categorized as : i) Design of Application Specific Accelerators with Customized Memory Layout ii) Template Based Design Support for Customized Memory Accelerators and iii) Design Space Explorations for Throughput Oriented Devices with Standard and Customized Memories. This thesis concludes with a conceptual proposal on a Blacksmith Streaming Architecture (BSArc). The Blacksmith Computing allow the hardware-level adoption of an application specific front-end with a GPU like streaming back-end. This gives an opportunity to exploit maximum possible data locality and the data level parallelism from an application while providing a throughput natured powerful back-end. We consider that the design of these specialized memory layouts for the front-end of the device are provided by the application domain experts in the form of templates. These templates are adjustable according to a device and the problem size at the device's configuration time. The physical availability of such an architecture may still take time. However, simulation framework helps in architectural explorations to give insight into the proposal and predicts potential performance benefits for such an architecture.Postprint (published version
    corecore