3,297 research outputs found
Recommended from our members
Design and performance optimization of asynchronous networks-on-chip
As digital systems continue to grow in complexity, the design of conventional synchronous systems is facing unprecedented challenges. The number of transistors on individual chips is already in the multi-billion range, and a greatly increasing number of components are being integrated onto a single chip. As a consequence, modern digital designs are under strong time-to-market pressure, and there is a critical need for composable design approaches for large complex systems.
In the past two decades, networks-on-chip (NoC’s) have been a highly active research area. In a NoC-based system, functional blocks are first designed individually and may run at different clock rates. These modules are then connected through a structured network for on-chip global communication. However, due to the rigidity of centrally-clocked NoC’s, there have been bottlenecks of system scalability, energy and performance, which cannot be easily solved with synchronous approaches. As a result, there has been significant recent interest in combing the notion of asynchrony with NoC designs. Since the NoC approach inherently separates the communication infrastructure, and its timing, from computational elements, it is a natural match for an asynchronous paradigm. Asynchronous NoC’s, therefore, enable a modular and extensible system composition for an ‘object-orient’ design style.
The thesis aims to significantly advance the state-of-art and viability of asynchronous and globally-asynchronous locally-synchronous (GALS) networks-on-chip, to enable high-performance and low-energy systems. The proposed asynchronous NoC’s are nearly entirely based on standard cells, which eases their integration into industrial design flows. The contributions are instantiated in three different directions.
First, practical acceleration techniques are proposed for optimizing the system latency, in order to break through the latency bottleneck in the memory interfaces of many on-chip parallel processors. Novel asynchronous network protocols are proposed, along with concrete NoC designs. A new concept, called ‘monitoring network’, is introduced. Monitoring networks are lightweight shadow networks used for fast-forwarding anticipated traffic information, ahead of the actual packet traffic. The routers are therefore allowed to initiate and perform arbitration and channel allocation in advance. The technique is successfully applied to two topologies which belong to two different categories – a variant mesh-of-trees (MoT) structure and a 2D-mesh topology. Considerable and stable latency improvements are observed across a wide range of traffic patterns, along with moderate throughput gains.
Second, for the first time, a high-performance and low-power asynchronous NoC router is compared directly to a leading commercial synchronous counterpart in an advanced industrial technology. The asynchronous router design shows significant performance improvements, as well as area and power savings. The proposed asynchronous router integrates several advanced techniques, including a low-latency circular FIFO for buffer design, and a novel end-to-end credit-based virtual channel (VC) flow control. In addition, a semi-automated design flow is created, which uses portions of a standard synchronous tool flow.
Finally, a high-performance multi-resource asynchronous arbiter design is developed. This small but important component can be directly used in existing asynchronous NoC’s for performance optimization. In addition, this standalone design promises use in opening up new NoC directions, as well as for general use in parallel systems. In the proposed arbiter design, the allocation of a resource to a client is divided into several steps. Multiple successive client-resource pairs can be selected rapidly in pipelined sequence, and the completion of the assignments can overlap in parallel.
In sum, the thesis provides a set of advanced design solutions for performance optimization of asynchronous and GALS networks-on-chip. These solutions are at different levels, from network protocols, down to router- and component-level optimizations, which can be directly applied to existing basic asynchronous NoC designs to provide a leap in performance improvement
Recommended from our members
On Multicast in Asynchronous Networks-on-Chip: Techniques, Architectures, and FPGA Implementation
In this era of exascale computing, conventional synchronous design techniques are facing unprecedented challenges. The consumer electronics market is replete with many-core systems in the range of 16 cores to thousands of cores on chip, integrating multi-billion transistors. However, with this ever increasing complexity, the traditional design approaches are facing key issues such as increasing chip power, process variability, aging, thermal problems, and scalability. An alternative paradigm that has gained significant interest in the last decade is asynchronous design. Asynchronous designs have several potential advantages: they are naturally energy proportional, burning power only when active, do not require complex clock distribution, are robust to different forms of variability, and provide ease of composability for heterogeneous platforms. Networks-on-chip (NoCs) is an interconnect paradigm that has been introduced to deal with the ever-increasing system complexity. NoCs provide a distributed, scalable, and efficient interconnect solution for today’s many-core systems. Moreover, NoCs are a natural match with asynchronous design techniques, as they separate communication infrastructure and timing from the computational elements. To this end, globally-asynchronous locally-synchronous (GALS) systems that interconnect multiple processing cores, operating at different clock speeds, using an asynchronous NoC, have gained significant interest. While asynchronous NoCs have several advantages, they also face a key challenge of supporting new types of traffic patterns. Once such pattern is multicast communication, where a source sends packets to arbitrary number of destinations. Multicast is not only common in parallel computing, such as for cache coherency, but also for emerging areas such as neuromorphic computing. This important capability has been largely missing from asynchronous NoCs. This thesis introduces several efficient multicast solutions for these interconnects. In particular, techniques, and network architectures are introduced to support high-performance and low-power multicast. Two leading network topologies are the focus: a variant mesh-of-trees (MoT) and a 2D mesh. In addition, for a more realistic implementation and analysis, as well as significantly advancing the field of asynchronous NoCs, this thesis also targets synthesis of these NoCs on commercial FPGAs. While there has been significant advances in FPGA technologies, there has been only limited research on implementing asynchronous NoCs on FPGAs. To this end, a systematic computeraided design (CAD) methodology has been introduced to efficiently and safely map asynchronous NoCs on FPGAs. Overall, this thesis makes the following three contributions. The first contribution is a multicast solution for a variant MoT network topology. This topology consists of simple low-radix switches, and has been used in high-performance computing platforms. A novel local speculation technique is introduced, where a subset of the network’s switches are speculative that always broadcast every packet. These switches are very simple and have high performance. Speculative switches are surrounded by non-speculative ones that route packets based on their destinations and also throttle any redundant copies created by the former. This hybrid network architecture achieved significant performance and power benefits over other multicast approaches. The second contribution is a multicast solution for a 2D-mesh topology, which is more complex with higher-radix switches and also is more commonly used. A novel continuous-time replication strategy is introduced to optimize the critical multi-way forking operation of a multicast transmission. In this technique, a multicast packet is first stored in an input port of a switch, from where it is sent through distinct output ports towards different destinations concurrently, at each output’s own rate and in continuous time. This strategy is shown to have significant latency and energy benefits over an approach that performs multicast using multiple distinct serial unicasts to each destination. Finally, a systematic CAD methodology is introduced to synthesize asynchronous NoCs on commercial FPGAs. A two-fold goal is targeted: correctness and high performance. For ease of implementation, only existing FPGA synthesis tools are used. Moreover, since asynchronous NoCs involve special asynchronous components, a comprehensive guide is introduced to map these elements correctly and efficiently. Two asynchronous NoC switches are synthesized using the proposed approach on a leading Xilinx FPGA in 28 nm: one that only handles unicast, and the other that also supports multicast. Both showed significant energy benefits with some performance gains over a state-of-the-art synchronous switch
A novel ultrasonic strain gauge for single-sided measurement of a local 3D strain field
A novel method is introduced for the measurement of a 3D strain field by exploiting the interaction between ultrasound waves and geometrical characteristics of the insonified specimen. First, the response of obliquely incident harmonic waves to a deterministic surface roughness is utilized. Analysis of backscattered amplitudes in Bragg diffraction geometry then yields a measure for the in-plane strain field by mapping any shift in angular dependency. Secondly, the analysis of the reflection characteristics of normal incident pulsed waves in frequency domain provides a measure of the out-of-plane normal strain field component, simply by tracking any change in the stimulation condition for a thickness resonance. As such, the developed ultrasonic strain gauge yields an absolute, contactless and single-sided mapping of a local 3D strain field, in which both sample preparation and alignment procedure are needless. Results are presented for cold-rolled DC06 steel samples onto which skin passing of the work rolls is applied. The samples have been mechanically loaded, introducing plastic strain levels ranging from 2% up to 35%. The ultrasonically measured strains have been validated with various other strain measurement techniques, including manual micrometer, longitudinal and transverse mechanical extensometer and optical mono- and stereovision digital image correlation. Good agreement has been obtained between the ultrasonically determined strain values and the results of the conventional methods. As the ultrasonic strain gauge provides all three normal strain field components, it has been employed for the extraction of Lankford ratios at different applied longitudinal plastic strain levels, revealing a strain dependent plastic anisotropy of the investigated DC06 steel sheet
Nuclear Power Plants
This book covers various topics, from thermal-hydraulic analysis to the safety analysis of nuclear power plant. It does not focus only on current power plant issues. Instead, it aims to address the challenging ideas that can be implemented in and used for the development of future nuclear power plants. This book will take the readers into the world of innovative research and development of future plants. Find your interests inside this book
Coprocessor integration for real-time event processing in particle physics detectors
Els experiments de fÃsica d’altes energies actuals disposen d’acceleradors amb més energÃa, sensors més precisos i formes més flexibles de recopilar les dades. Aquesta rà pida evolució requereix de més capacitat de cà lcul; els processadors massivament paral·lels, com ara les targes acceleradores grà fiques, ens posen a l’abast aquesta major capacitat de cà lcul a un cost sensiblement inferior a les CPUs tradicionals. L’ús d’aquest tipus de processadors requereix, però, de nous algoritmes i nous enfocaments de l’organització de les dades que són difÃcils d’integrar en els programaris actuals.
En aquest treball s’exploren els problemes derivats de l’ús d’algoritmes paral·lels en els entorns de programari existents, orientats a CPUs, i es proposa una solució, en forma de servei, que comunica amb els diversos pipelines que processen els esdeveniments procedents de les col·lisions de partÃcules, recull les dades en lots i els envia als algoritmes corrent sobre els processadors massivament paral·lels.
Aquest servei s’integra en Gaudà - l’entorn de software de dos dels quatre experiments principals del Gran Col·lisionador d’Hadrons. S’examina el sobrecost que el servei afegeix als algoritmes paral·lels. S’estudia un cas d´ùs del servei per fer una reconstrucció paral·lela de les traces detectades en el VELO Pixel, el subdetector encarregat de la detecció de vèrtex en l’upgrade de LHCb. Per aquest cas, s’observen les caracterÃstiques del rendiment en funció de la mida dels lots de dades. Finalment, les conclusions en posen en el context dels requeriments del sistema de trigger de LHCb.La fÃsica de altas energÃas dispone actualmente de aceleradores con energÃas mayores, sensores más precisos y métodos de recopilación de datos más flexibles que nunca. Su rápido progreso necesita aún más potencia de cálculo; el hardware masivamente paralelo, como las unidades de procesamiento gráfico, nos brinda esta potencia a un coste mucho más bajo que las CPUs tradicionales. Sin embargo, para usar eficientemente este hardware necesitamos algoritmos nuevos y nuevos enfoques de organización de datos difÃciles de integrarse con el software existente.
En este trabajo, se investiga cómo se pueden usar estos algoritmos paralelos en las infraestructuras de software ya existentes y que están orientadas a CPUs. Se propone una solución en forma de un servicio que comunica con los diversos pipelines que procesan los eventos de las correspondientes colisiones de particulas, reúne los datos en lotes y se los entrega a los algoritmos paralelos acelerados por hardware.
Este servicio se integra con Gaudà — la infraestructura del entorno de software que usan dos de los cuatro gran experimentos del Gran Colisionador de Hadrones. Se examinan los costes añadidos por el servicio en los algoritmos paralelos. Se estudia un caso de uso del servicio para ejecutar un algoritmo paralelo para el VELO Pixel (el subdetector encargado de la localización de vértices en el upgrade del experimento LHCb) y se estudian las caracterÃsticas de rendimiento de los distintos tamaños de lotes de datos. Finalmente, las conclusiones se contextualizan dentro la perspectiva de los requerimientos para el sistema de trigger de LHCb.High-energy physics experiments today have higher energies, more accurate sensors, and more flexible means of data collection than ever before. Their rapid progress requires ever more computational power; and massively parallel hardware, such as graphics cards, holds the promise to provide this power at a much lower cost than traditional CPUs. Yet, using this hardware requires new algorithms and new approaches to organizing data that can be difficult to integrate with existing software.
In this work, I explore the problem of using parallel algorithms within existing CPU-orientated frameworks and propose a compromise between the different trade-offs. The solution is a service that communicates with multiple event-processing pipelines, gathers data into batches, and submits them to hardware-accelerated parallel algorithms.
I integrate this service with Gaudi — a framework underlying the software environments of two of the four major experiments at the Large Hadron Collider. I examine the overhead the service adds to parallel algorithms. I perform a case study of using the service to run a parallel track reconstruction algorithm for the LHCb experiment's prospective VELO Pixel subdetector and look at the performance characteristics of using different data batch sizes. Finally, I put the findings into perspective within the context of the LHCb trigger's requirements
- …