312 research outputs found

    Empowering parallel computing with field programmable gate arrays

    Get PDF
    After more than 30 years, reconfigurable computing has grown from a concept to a mature field of science and technology. The cornerstone of this evolution is the field programmable gate array, a building block enabling the configuration of a custom hardware architecture. The departure from static von Neumannlike architectures opens the way to eliminate the instruction overhead and to optimize the execution speed and power consumption. FPGAs now live in a growing ecosystem of development tools, enabling software programmers to map algorithms directly onto hardware. Applications abound in many directions, including data centers, IoT, AI, image processing and space exploration. The increasing success of FPGAs is largely due to an improved toolchain with solid high-level synthesis support as well as a better integration with processor and memory systems. On the other hand, long compile times and complex design exploration remain areas for improvement. In this paper we address the evolution of FPGAs towards advanced multi-functional accelerators, discuss different programming models and their HLS language implementations, as well as high-performance tuning of FPGAs integrated into a heterogeneous platform. We pinpoint fallacies and pitfalls, and identify opportunities for language enhancements and architectural refinements

    FPGA dynamic and partial reconfiguration : a survey of architectures, methods, and applications

    Get PDF
    Dynamic and partial reconfiguration are key differentiating capabilities of field programmable gate arrays (FPGAs). While they have been studied extensively in academic literature, they find limited use in deployed systems. We review FPGA reconfiguration, looking at architectures built for the purpose, and the properties of modern commercial architectures. We then investigate design flows, and identify the key challenges in making reconfigurable FPGA systems easier to design. Finally, we look at applications where reconfiguration has found use, as well as proposing new areas where this capability places FPGAs in a unique position for adoption

    Venice: Exploring Server Architectures for Effective Resource Sharing

    Get PDF
    Consolidated server racks are quickly becoming the backbone of IT infrastructure for science, engineering, and business, alike. These servers are still largely built and organized as when they were distributed, individual entities. Given that many fields increasingly rely on analytics of huge datasets, it makes sense to support flexible resource utilization across servers to improve cost-effectiveness and performance. We introduce Venice, a family of data-center server architectures that builds a strong communication substrate as a first-class resource for server chips. Venice provides a diverse set of resource-joining mechanisms that enables user programs to efficiently leverage non-local resources. To better understand the implications of design decisions about system support for resource sharing we have constructed a hardware prototype that allows us to more accurately measure end-to-end performance of at-scale applications and to explore tradeoffs among performance, power, and resource-sharing transparency. We present results from our initial studies analyzing these tradeoffs when sharing memory, accelerators, or NICs. We find that it is particularly important to reduce or hide latency, that data-sharing access patterns should match the features of the communication channels employed, and that inter-channel collaboration can be exploited for better performance

    Customizing the Computation Capabilities of Microprocessors.

    Full text link
    Designers of microprocessor-based systems must constantly improve performance and increase computational efficiency in their designs to create value. To this end, it is increasingly common to see computation accelerators in general-purpose processor designs. Computation accelerators collapse portions of an application's dataflow graph, reducing the critical path of computations, easing the burden on processor resources, and reducing energy consumption in systems. There are many problems associated with adding accelerators to microprocessors, though. Design of accelerators, architectural integration, and software support all present major challenges. This dissertation tackles these challenges in the context of accelerators targeting acyclic and cyclic patterns of computation. First, a technique to identify critical computation subgraphs within an application set is presented. This technique is hardware-cognizant and effectively generates a set of instruction set extensions given a domain of target applications. Next, several general-purpose accelerator structures are quantitatively designed using critical subgraph analysis for a broad application set. The next challenge is architectural integration of accelerators. Traditionally, software invokes accelerators by statically encoding new instructions into the application binary. This is incredibly costly, though, requiring many portions of hardware and software to be redesigned. This dissertation develops strategies to utilize accelerators, without changing the instruction set. In the proposed approach, the microarchitecture translates applications at run-time, replacing computation subgraphs with microcode to utilize accelerators. We explore the tradeoffs in performing difficult aspects of the translation at compile-time, while retaining run-time replacement. This culminates in a simple microarchitectural interface that supports a plug-and-play model for integrating accelerators into a pre-designed microprocessor. Software support is the last challenge in dealing with computation accelerators. The primary issue is difficulty in generating high-quality code utilizing accelerators. Hand-written assembly code is standard in industry, and if compiler support does exist, simple greedy algorithms are common. In this work, we investigate more thorough techniques for compiling for computation accelerators. Where greedy heuristics only explore one possible solution, the techniques in this dissertation explore the entire design space, when possible. Intelligent pruning methods ensure that compilation is both tractable and scalable.Ph.D.Computer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/57633/2/ntclark_1.pd

    Application of novel technologies for the development of next generation MR compatible PET inserts

    Get PDF
    Multimodal imaging integrating Positron Emission Tomography and Magnetic Resonance Imaging (PET/MRI) has professed advantages as compared to other available combinations, allowing both functional and structural information to be acquired with very high precision and repeatability. However, it has yet to be adopted as the standard for experimental and clinical applications, due to a variety of reasons mainly related to system cost and flexibility. A hopeful existing approach of silicon photodetector-based MR compatible PET inserts comprised by very thin PET devices that can be inserted in the MRI bore, has been pioneered, without disrupting the market as expected. Technological solutions that exist and can make this type of inserts lighter, cost-effective and more adaptable to the application need to be researched further. In this context, we expand the study of sub-surface laser engraving (SSLE) for scintillators used for PET. Through acquiring, measuring and calibrating the use of a SSLE setting we study the effect of different engraving configurations on detection characteristics of the scintillation light by the photosensors. We demonstrate that apart from cost-effectiveness and ease of application, SSLE treated scintillators have similar spatial resolution and superior sensitivity and packing fraction as compared to standard pixelated arrays, allowing for shorter crystals to be used. Flexibility of design is benchmarked and adoption of honeycomb architecture due to geometrical advantages is proposed. Furthermore, a variety of depth-of-interaction (DoI) designs are engraved and studied, greatly enhancing applicability in small field-of-view tomographs, such as the intended inserts. To adapt to this need, a novel approach for multi-layer DoI characterization has been developed and is demonstrated. Apart from crystal treatment, considerations on signal transmission and processing are addressed. A double time-over-threshold (ToT) method is proposed, using the statistics of noise in order to enhance precision. This method is tested and linearity results demonstrate applicability for multiplexed readout designs. A study on analog optical wireless communication (aOWC) techniques is also performed and proof of concept results presented. Finally, a ToT readout firmware architecture, intended for low-cost FPGAs, has been developed and is described. By addressing the potential development, applicability and merits of a range of transdisciplinary solutions, we demonstrate that with these techniques it is possible to construct lighter, smaller, lower consumption, cost-effective MRI compatible PET inserts. Those designs can make PET/MRI multimodality the dominant clinical and experimental imaging approach, enhancing researcher and physician insight to the mysteries of life.La combinación multimodal de Tomografía por Emisión de Positrones con la Imagen de Resonancia Magnética (PET/MRI, de sus siglas en inglés) tiene clara ventajas en comparación con otras técnicas multimodales actualmente disponibles, dada su capacidad para registrar información funcional e información estructural con mucha precisión y repetibilidad. Sin embargo, esta técnica no acaba de penetrar en la práctica clínica debido en gran parte a alto coste. Las investigaciones que persiguen mejorar el desarrollo de insertos de PET basados en fotodetectores de silicio y compatibles con MRI, aunque han sido intensas y han generado soluciones ingeniosas, todavía no han conseguido encontrar las soluciones que necesita la industria. Sin embargo, existen opciones todavía sin explorar que podrían ayudar a evolucionar este tipo de insertos consiguiendo dispositivos más ligeros, baratos y con mejores prestaciones. Esta tesis profundiza en el estudio de grabación sub-superficie con láser (SSLE) para el diseño de los cristales centelladores usados en los sistemas PET. Para ello hemos caracterizado, medido y calibrado un procedimiento SSLE, y a continuación hemos estudiado el efecto que tienen sobre las especificaciones del detector las diferentes configuraciones del grabado. Demostramos que además de la rentabilidad y facilidad de uso de esta técnica, los centelladores SSLE tienen resolución espacial equivalente y sensibilidad y fracción de empaquetamiento superiores a las matrices de centelleo convencionales, lo que posibilita utilizar cristales más cortos para conseguir la misma sensibilidad. Estos diseños también permiten medir la profundidad de la interacción (DoI), lo que facilita el uso de estos diseños en tomógrafos de radio pequeño, como pueden ser los sistemas preclínicos, los dedicados (cabeza o mama) o los insertos para MRI. Además de trabajar en el tratamiento de cristal de centelleo, hemos considerado nuevas aproximaciones al procesamiento y transmisión de la señal. Proponemos un método innovador de doble medida de tiempo sobre el umbral (ToT) que integra una evaluación de la estadística del ruido con el propósito de mejorar la precisión. El método se ha validado y los resultados demuestran su viabilidad de uso incluso en conjuntos de señales multiplexadas. Un estudio de las técnicas de comunicación óptica analógica e inalámbrica (aOWC) ha permitido el desarrollo de una nueva propuesta para comunicar las señales del detector PET insertado en el gantry a un el procesador de señal externo, técnica que se ha validado en un demostrador. Finalmente, se ha propuesto y demostrado una nueva arquitectura de análisis de señal ToT implementada en firmware en FPGAs de bajo coste. La concepción y desarrollo de estas ideas, así como la evaluación de los méritos de las diferentes soluciones propuestas, demuestran que con estas técnicas es posible construir insertos de PET compatibles con sistemas MRI, que serán más ligeros y compactos, con un reducido consumo y menor coste. De esta forma se contribuye a que la técnica multimodal PET/MRI pueda penetrar en la clínica, mejorando la comprensión que médicos e investigadores puedan alcanzar en su estudio de los misterios de la vida.Programa Oficial de Doctorado en Ingeniería Eléctrica, Electrónica y AutomáticaPresidente: Andrés Santos Lleó.- Secretario: Luis Hernández Corporales.- Vocal: Giancarlo Sportell

    Optimizing the Use of Behavioral Locking for High-Level Synthesis

    Get PDF
    The globalization of the electronics supply chain requires effective methods to thwart reverse engineering and IP theft. Logic locking is a promising solution, but there are many open concerns. First, even when applied at a higher level of abstraction, locking may result in significant overhead without improving the security metric. Second, optimizing a security metric is application-dependent and designers must evaluate and compare alternative solutions. We propose a meta-framework to optimize the use of behavioral locking during the high-level synthesis (HLS) of IP cores. Our method operates on chip’s specification (before HLS) and it is compatible with all HLS tools, complementing industrial EDA flows. Our meta-framework supports different strategies to explore the design space and to select points to be locked automatically. We evaluated our method on the optimization of differential entropy, achieving better results than random or topological locking: 1) we always identify a valid solution that optimizes the security metric, while topological and random locking can generate unfeasible solutions; 2) we minimize the number of bits used for locking up to more than 90% (requiring smaller tamper-proof memories); 3) we make better use of hardware resources since we obtain similar overheads but with higher security metric

    Automatic Creation of High-Bandwidth Memory Architectures from Domain-Specific Languages: The Case of Computational Fluid Dynamics

    Get PDF
    Numerical simulations can help solve complex problems. Most of these algorithms are massively parallel and thus good candidates for FPGA acceleration thanks to spatial parallelism. Modern FPGA devices can leverage high-bandwidth memory technologies, but when applications are memory-bound designers must craft advanced communication and memory architectures for efficient data movement and on-chip storage. This development process requires hardware design skills that are uncommon in domain-specific experts. In this paper, we propose an automated tool flow from a domain-specific language (DSL) for tensor expressions to generate massively-parallel accelerators on HBM-equipped FPGAs. Designers can use this flow to integrate and evaluate various compiler or hardware optimizations. We use computational fluid dynamics (CFD) as a paradigmatic example. Our flow starts from the high-level specification of tensor operations and combines an MLIR-based compiler with an in-house hardware generation flow to generate systems with parallel accelerators and a specialized memory architecture that moves data efficiently, aiming at fully exploiting the available CPU-FPGA bandwidth. We simulated applications with millions of elements, achieving up to 103 GFLOPS with one compute unit and custom precision when targeting a Xilinx Alveo U280. Our FPGA implementation is up to 25x more energy efficient than expert-crafted Intel CPU implementations

    A RECONFIGURABLE AND EXTENSIBLE EXPLORATION PLATFORM FOR FUTURE HETEROGENEOUS SYSTEMS

    Get PDF
    Accelerator-based -or heterogeneous- computing has become increasingly important in a variety of scenarios, ranging from High-Performance Computing (HPC) to embedded systems. While most solutions use sometimes custom-made components, most of today’s systems rely on commodity highend CPUs and/or GPU devices, which deliver adequate performance while ensuring programmability, productivity, and application portability. Unfortunately, pure general-purpose hardware is affected by inherently limited power-efficiency, that is, low GFLOPS-per-Watt, now considered as a primary metric. The many-core model and architectural customization can play here a key role, as they enable unprecedented levels of power-efficiency compared to CPUs/GPUs. However, such paradigms are still immature and deeper exploration is indispensable. This dissertation investigates customizability and proposes novel solutions for heterogeneous architectures, focusing on mechanisms related to coherence and network-on-chip (NoC). First, the work presents a non-coherent scratchpad memory with a configurable bank remapping system to reduce bank conflicts. The experimental results show the benefits of both using a customizable hardware bank remapping function and non-coherent memories for some types of algorithms. Next, we demonstrate how a distributed synchronization master better suits many-cores than standard centralized solutions. This solution, inspired by the directory-based coherence mechanism, supports concurrent synchronizations without relying on memory transactions. The results collected for different NoC sizes provided indications about the area overheads incurred by our solution and demonstrated the benefits of using a dedicated hardware synchronization support. Finally, this dissertation proposes an advanced coherence subsystem, based on the sparse directory approach, with a selective coherence maintenance system which allows coherence to be deactivated for blocks that do not require it. Experimental results show that the use of a hybrid coherent and non-coherent architectural mechanism along with an extended coherence protocol can enhance performance. The above results were all collected by means of a modular and customizable heterogeneous many-core system developed to support the exploration of power-efficient high-performance computing architectures. The system is based on a NoC and a customizable GPU-like accelerator core, as well as a reconfigurable coherence subsystem, ensuring application-specific configuration capabilities. All the explored solutions were evaluated on this real heterogeneous system, which comes along with the above methodological results as part of the contribution in this dissertation. In fact, as a key benefit, the experimental platform enables users to integrate novel hardware/software solutions on a full-system scale, whereas existing platforms do not always support a comprehensive heterogeneous architecture exploration
    corecore