11 research outputs found

    Deliverable D3.4: WP3 overall public deliverable

    Get PDF

    Architectural explorations for streaming accelerators with customized memory layouts

    Get PDF
    El concepto b谩sico de la arquitectura mono-nucleo en los procesadores de prop贸sito general se ajusta bien a un modelo de programaci贸n secuencial. La integraci贸n de multiples n煤cleos en un solo chip ha permitido a los procesadores correr partes del programa en paralelo. Sin embargo, la explotaci贸n del enorme paralelismo disponible en muchas aplicaciones de alto rendimiento y de los datos correspondientes es dif铆cil de conseguir usando unicamente multicores de prop贸sito general. La aparici贸n de aceleradores tipo streaming y de los correspondientes modelos de programaci贸n han mejorado esta situaci贸n proporcionando arquitecturas orientadas al proceso de flujos de datos. La idea b谩sica detr谩s del dise帽o de estas arquitecturas responde a la necesidad de procesar conjuntos enormes de datos. Estos dispositivos de alto rendimiento orientados a flujos permiten el procesamiento rapido de datos mediante el uso eficiente de computaci贸n paralela y comunicaci贸n entre procesos. Los aceleradores streaming orientados a flujos, igual que en otros procesadores, consisten en diversos componentes micro-arquitectonicos como por ejemplo las estructuras de memoria, las unidades de computo, las unidades de control, los canales de Entrada/Salida y controles de Entrada/Salida, etc. Sin embargo, los requisitos del flujo de datos agregan algunas caracter铆sticas especiales e imponen otras restricciones que afectan al rendimiento. Estos dispositivos, por lo general, ofrecen un gran n煤mero de recursos computacionales, pero obligan a reorganizar los conjuntos de datos en paralelo, maximizando la independiencia para alimentar los recursos de computaci贸n en forma de flujos. La disposici贸n de datos en conjuntos independientes de flujos paralelos no es una tarea sencilla. Es posible que se tenga que cambiar la estructura de un algoritmo en su conjunto o, incluso, puede requerir la reescritura del algoritmo desde cero. Sin embargo, todos estos esfuerzos para la reordenaci贸n de los patrones de las aplicaciones de acceso a datos puede que no sean muy 煤tiles para lograr un rendimiento 贸ptimo. Esto es debido a las posibles limitaciones microarquitectonicas de la plataforma de destino para los mecanismos hardware de prefetch, el tama帽o y la granularidad del almacenamiento local, y la flexibilidad para disponer de forma serial los datos en el interior del almacenamiento local. Las limitaciones de una plataforma de streaming de proposito general para el prefetching de datos, almacenamiento y demas procedimientos para organizar y mantener los datos en forma de flujos paralelos e independientes podr铆a ser eliminado empleando t茅cnicas a nivel micro-arquitectonico. Esto incluye el uso de memorias personalizadas especificamente para las aplicaciones en el front-end de una arquitectura streaming. El objetivo de esta tesis es presentar exploraciones arquitect贸nicas de los aceleradores streaming con dise帽os de memoria personalizados. En general, la tesis cubre tres aspectos principales de tales aceleradores. Estos aspectos se pueden clasificar como: i) Dise帽o de aceleradores de aplicaciones espec铆ficas con dise帽os de memoria personalizados, ii) dise帽o de aceleradores con memorias personalizadas basados en plantillas, y iii) exploraciones del espacio de dise帽o para dispositivos orientados a flujos con las memorias est谩ndar y personalizadas. Esta tesis concluye con la propuesta conceptual de una Blacksmith Streaming Architecture (BSArc). El modelo de computaci贸n Blacksmith permite la adopci贸n a nivel de hardware de un front-end de aplicaci贸n espec铆fico utilizando una GPU como back-end. Esto permite maximizar la explotaci贸n de la localidad de datos y el paralelismo a nivel de datos de una aplicaci贸n mientras que proporciona un flujo mayor de datos al back-end. Consideramos que el dise帽o de estos procesadores con memorias especializadas debe ser proporcionado por expertos del dominio de aplicaci贸n en la forma de plantillas.The basic concept behind the architecture of a general purpose CPU core conforms well to a serial programming model. The integration of more cores on a single chip helped CPUs in running parts of a program in parallel. However, the utilization of huge parallelism available from many high performance applications and the corresponding data is hard to achieve from these general purpose multi-cores. Streaming accelerators and the corresponding programing models improve upon this situation by providing throughput oriented architectures. The basic idea behind the design of these architectures matches the everyday increasing requirements of processing huge data sets. These high-performance throughput oriented devices help in high performance processing of data by using efficient parallel computations and streaming based communications. The throughput oriented streaming accelerators 驴 similar to the other processors 驴 consist of numerous types of micro-architectural components including the memory structures, compute units, control units, I/O channels and I/O controls etc. However, the throughput requirements add some special features and impose other restrictions for the performance purposes. These devices, normally, offer a large number of compute resources but restrict the applications to arrange parallel and maximally independent data sets to feed the compute resources in the form of streams. The arrangement of data into independent sets of parallel streams is not an easy and simple task. It may need to change the structure of an algorithm as a whole or even it can require to write a new algorithm from scratch for the target application. However, all these efforts for the re-arrangement of application data access patterns may still not be very helpful to achieve the optimal performance. This is because of the possible micro-architectural constraints of the target platform for the hardware pre-fetching mechanisms, the size and the granularity of the local storage and the flexibility in data marshaling inside the local storage. The constraints of a general purpose streaming platform on the data pre-fetching, storing and maneuvering to arrange and maintain it in the form of parallel and independent streams could be removed by employing micro-architectural level design approaches. This includes the usage of application specific customized memories in the front-end of a streaming architecture. The focus of this thesis is to present architectural explorations for the streaming accelerators using customized memory layouts. In general the thesis covers three main aspects of such streaming accelerators in this research. These aspects can be categorized as : i) Design of Application Specific Accelerators with Customized Memory Layout ii) Template Based Design Support for Customized Memory Accelerators and iii) Design Space Explorations for Throughput Oriented Devices with Standard and Customized Memories. This thesis concludes with a conceptual proposal on a Blacksmith Streaming Architecture (BSArc). The Blacksmith Computing allow the hardware-level adoption of an application specific front-end with a GPU like streaming back-end. This gives an opportunity to exploit maximum possible data locality and the data level parallelism from an application while providing a throughput natured powerful back-end. We consider that the design of these specialized memory layouts for the front-end of the device are provided by the application domain experts in the form of templates. These templates are adjustable according to a device and the problem size at the device's configuration time. The physical availability of such an architecture may still take time. However, simulation framework helps in architectural explorations to give insight into the proposal and predicts potential performance benefits for such an architecture

    MIMO Systems

    Get PDF
    In recent years, it was realized that the MIMO communication systems seems to be inevitable in accelerated evolution of high data rates applications due to their potential to dramatically increase the spectral efficiency and simultaneously sending individual information to the corresponding users in wireless systems. This book, intends to provide highlights of the current research topics in the field of MIMO system, to offer a snapshot of the recent advances and major issues faced today by the researchers in the MIMO related areas. The book is written by specialists working in universities and research centers all over the world to cover the fundamental principles and main advanced topics on high data rates wireless communications systems over MIMO channels. Moreover, the book has the advantage of providing a collection of applications that are completely independent and self-contained; thus, the interested reader can choose any chapter and skip to another without losing continuity

    Design of large polyphase filters in the Quadratic Residue Number System

    Full text link

    Temperature aware power optimization for multicore floating-point units

    Full text link

    Parallel and Distributed Computing

    Get PDF
    The 14 chapters presented in this book cover a wide variety of representative works ranging from hardware design to application development. Particularly, the topics that are addressed are programmable and reconfigurable devices and systems, dependability of GPUs (General Purpose Units), network topologies, cache coherence protocols, resource allocation, scheduling algorithms, peertopeer networks, largescale network simulation, and parallel routines and algorithms. In this way, the articles included in this book constitute an excellent reference for engineers and researchers who have particular interests in each of these topics in parallel and distributed computing

    Architectural explorations for streaming accelerators with customized memory layouts

    Get PDF
    El concepto b谩sico de la arquitectura mono-nucleo en los procesadores de prop贸sito general se ajusta bien a un modelo de programaci贸n secuencial. La integraci贸n de multiples n煤cleos en un solo chip ha permitido a los procesadores correr partes del programa en paralelo. Sin embargo, la explotaci贸n del enorme paralelismo disponible en muchas aplicaciones de alto rendimiento y de los datos correspondientes es dif铆cil de conseguir usando unicamente multicores de prop贸sito general. La aparici贸n de aceleradores tipo streaming y de los correspondientes modelos de programaci贸n han mejorado esta situaci贸n proporcionando arquitecturas orientadas al proceso de flujos de datos. La idea b谩sica detr谩s del dise帽o de estas arquitecturas responde a la necesidad de procesar conjuntos enormes de datos. Estos dispositivos de alto rendimiento orientados a flujos permiten el procesamiento rapido de datos mediante el uso eficiente de computaci贸n paralela y comunicaci贸n entre procesos. Los aceleradores streaming orientados a flujos, igual que en otros procesadores, consisten en diversos componentes micro-arquitectonicos como por ejemplo las estructuras de memoria, las unidades de computo, las unidades de control, los canales de Entrada/Salida y controles de Entrada/Salida, etc. Sin embargo, los requisitos del flujo de datos agregan algunas caracter铆sticas especiales e imponen otras restricciones que afectan al rendimiento. Estos dispositivos, por lo general, ofrecen un gran n煤mero de recursos computacionales, pero obligan a reorganizar los conjuntos de datos en paralelo, maximizando la independiencia para alimentar los recursos de computaci贸n en forma de flujos. La disposici贸n de datos en conjuntos independientes de flujos paralelos no es una tarea sencilla. Es posible que se tenga que cambiar la estructura de un algoritmo en su conjunto o, incluso, puede requerir la reescritura del algoritmo desde cero. Sin embargo, todos estos esfuerzos para la reordenaci贸n de los patrones de las aplicaciones de acceso a datos puede que no sean muy 煤tiles para lograr un rendimiento 贸ptimo. Esto es debido a las posibles limitaciones microarquitectonicas de la plataforma de destino para los mecanismos hardware de prefetch, el tama帽o y la granularidad del almacenamiento local, y la flexibilidad para disponer de forma serial los datos en el interior del almacenamiento local. Las limitaciones de una plataforma de streaming de proposito general para el prefetching de datos, almacenamiento y demas procedimientos para organizar y mantener los datos en forma de flujos paralelos e independientes podr铆a ser eliminado empleando t茅cnicas a nivel micro-arquitectonico. Esto incluye el uso de memorias personalizadas especificamente para las aplicaciones en el front-end de una arquitectura streaming. El objetivo de esta tesis es presentar exploraciones arquitect贸nicas de los aceleradores streaming con dise帽os de memoria personalizados. En general, la tesis cubre tres aspectos principales de tales aceleradores. Estos aspectos se pueden clasificar como: i) Dise帽o de aceleradores de aplicaciones espec铆ficas con dise帽os de memoria personalizados, ii) dise帽o de aceleradores con memorias personalizadas basados en plantillas, y iii) exploraciones del espacio de dise帽o para dispositivos orientados a flujos con las memorias est谩ndar y personalizadas. Esta tesis concluye con la propuesta conceptual de una Blacksmith Streaming Architecture (BSArc). El modelo de computaci贸n Blacksmith permite la adopci贸n a nivel de hardware de un front-end de aplicaci贸n espec铆fico utilizando una GPU como back-end. Esto permite maximizar la explotaci贸n de la localidad de datos y el paralelismo a nivel de datos de una aplicaci贸n mientras que proporciona un flujo mayor de datos al back-end. Consideramos que el dise帽o de estos procesadores con memorias especializadas debe ser proporcionado por expertos del dominio de aplicaci贸n en la forma de plantillas.The basic concept behind the architecture of a general purpose CPU core conforms well to a serial programming model. The integration of more cores on a single chip helped CPUs in running parts of a program in parallel. However, the utilization of huge parallelism available from many high performance applications and the corresponding data is hard to achieve from these general purpose multi-cores. Streaming accelerators and the corresponding programing models improve upon this situation by providing throughput oriented architectures. The basic idea behind the design of these architectures matches the everyday increasing requirements of processing huge data sets. These high-performance throughput oriented devices help in high performance processing of data by using efficient parallel computations and streaming based communications. The throughput oriented streaming accelerators 驴 similar to the other processors 驴 consist of numerous types of micro-architectural components including the memory structures, compute units, control units, I/O channels and I/O controls etc. However, the throughput requirements add some special features and impose other restrictions for the performance purposes. These devices, normally, offer a large number of compute resources but restrict the applications to arrange parallel and maximally independent data sets to feed the compute resources in the form of streams. The arrangement of data into independent sets of parallel streams is not an easy and simple task. It may need to change the structure of an algorithm as a whole or even it can require to write a new algorithm from scratch for the target application. However, all these efforts for the re-arrangement of application data access patterns may still not be very helpful to achieve the optimal performance. This is because of the possible micro-architectural constraints of the target platform for the hardware pre-fetching mechanisms, the size and the granularity of the local storage and the flexibility in data marshaling inside the local storage. The constraints of a general purpose streaming platform on the data pre-fetching, storing and maneuvering to arrange and maintain it in the form of parallel and independent streams could be removed by employing micro-architectural level design approaches. This includes the usage of application specific customized memories in the front-end of a streaming architecture. The focus of this thesis is to present architectural explorations for the streaming accelerators using customized memory layouts. In general the thesis covers three main aspects of such streaming accelerators in this research. These aspects can be categorized as : i) Design of Application Specific Accelerators with Customized Memory Layout ii) Template Based Design Support for Customized Memory Accelerators and iii) Design Space Explorations for Throughput Oriented Devices with Standard and Customized Memories. This thesis concludes with a conceptual proposal on a Blacksmith Streaming Architecture (BSArc). The Blacksmith Computing allow the hardware-level adoption of an application specific front-end with a GPU like streaming back-end. This gives an opportunity to exploit maximum possible data locality and the data level parallelism from an application while providing a throughput natured powerful back-end. We consider that the design of these specialized memory layouts for the front-end of the device are provided by the application domain experts in the form of templates. These templates are adjustable according to a device and the problem size at the device's configuration time. The physical availability of such an architecture may still take time. However, simulation framework helps in architectural explorations to give insight into the proposal and predicts potential performance benefits for such an architecture.Postprint (published version

    Improving the Reliability of Microprocessors under BTI and TDDB Degradations

    Get PDF
    Reliability is a fundamental challenge for current and future microprocessors with advanced nanoscale technologies. With smaller gates, thinner dielectric and higher temperature microprocessors are vulnerable under aging mechanisms such as Bias Temperature Instability (BTI) and Temperature Dependent Dielectric Breakdown (TDDB). Under continuous stress both parametric and functional errors occur, resulting compromised microprocessor lifetime. In this thesis, based on the thorough study on BTI and TDDB mechanisms, solutions are proposed to mitigating the aging processes on memory based and random logic structures in modern out-of-order microprocessors. A large area of processor core is occupied by memory based structure that is vulnerable to BTI induced errors. The problem is exacerbated when PBTI degradation in NMOS is as severe as NBTI in PMOS in high-k metal gate technology. Hence a novel design is proposed to recover 4 internal gates within a SRAM cell simultaneously to mitigate both NBTI and PBTI effects. This technique is applied to both the L2 cache banks and the busy function units with storage cells in out-of-order pipeline in two different ways. For the L2 cache banks, redundant cache bank is added exclusively for proactive recovery rotation. For the critical and busy function units in out-of-order pipelines, idle cycles are exploited at per-buffer-entry level. Different from memory based structures, combinational logic structures such as function units in execution stage can not use low overhead redundancy to tolerate errors due to their irregular structure. A design framework that aims to improve the reliability of the vulnerable functional units of a processor core is designed and implemented. The approach is designing a generic function unit (GFU) that can be reconfigured to replace a particular functional unit (FU) while it is being recovered for improved lifetime. Although flexible, the GFU is slower than the original target FUs. So GFU is carefully designed so as to minimize the performance loss when it is in-use. More schemes are also designed to avoid using the GFU on performance critical paths of a program execution

    Translational pipelines for closed-loop neuromodulation

    Get PDF
    Closed-loop neuromodulation systems have shown significant potential for addressing unmet needs in the treatment of disorders of the central nervous system, yet progress towards clinical adoption has been slow. Advanced technological developments often stall in the preclinical stage by failing to account for the constraints of implantable medical devices, and due to the lack of research platforms with a translational focus. This thesis presents the development of three clinically relevant research systems focusing on refinements of deep brain stimulation therapies. First, we introduce a system for synchronising implanted and external stimulation devices, allowing for research into multi-site stimulation paradigms, cross-region neural plasticity, and questions of phase coupling. The proposed design aims to sidestep the limited communication capabilities of existing commercial implant systems in providing a stimulation state readout without reliance on telemetry, creating a cross-platform research tool. Next, we present work on the Picostim-DyNeuMo adaptive neuromodulation platform, focusing on expanding device capabilities from activity and circadian adaptation to bioelectric marker--based responsive stimulation. Here, we introduce a computationally optimised implementation of a popular band power--estimation algorithm suitable for deployment in the DyNeuMo system. The new algorithmic capability was externally validated to establish neural state classification performance in two widely-researched use cases: Parkinsonian beta bursts and seizures. For in vivo validation, a pilot experiment is presented demonstrating responsive neurostimulation to cortical alpha-band activity in a non-human primate model for the modulation of attention state. Finally, we turn our focus to the validation of a recently developed method to provide computationally efficient real-time phase estimation. Following theoretical analysis, the method is integrated into the commonly used Intan electrophysiological recording platform, creating a novel closed-loop optogenetics research platform. The performance of the research system is characterised through a pilot experiment, targeting the modulation of cortical theta-band activity in a transgenic mouse model
    corecore