312 research outputs found
Empowering parallel computing with field programmable gate arrays
After more than 30 years, reconfigurable computing has grown from a concept to a mature field of science and technology. The cornerstone of this evolution is the field programmable gate array, a building block enabling the configuration of a custom hardware architecture. The departure from static von Neumannlike architectures opens the way to eliminate the instruction overhead and to optimize the execution speed and power consumption. FPGAs now live in a growing ecosystem of development tools, enabling software programmers to map algorithms directly onto hardware. Applications abound in many directions, including data centers, IoT, AI, image processing and space exploration. The increasing success of FPGAs is largely due to an improved toolchain with solid high-level synthesis support as well as a better integration with processor and memory systems. On the other hand, long compile times and complex design exploration remain areas for improvement. In this paper we address the evolution of FPGAs towards advanced multi-functional accelerators, discuss different programming models and their HLS language implementations, as well as high-performance tuning of FPGAs integrated into a heterogeneous platform. We pinpoint fallacies and pitfalls, and identify opportunities for language enhancements and architectural refinements
FPGA dynamic and partial reconfiguration : a survey of architectures, methods, and applications
Dynamic and partial reconfiguration are key differentiating capabilities of field programmable gate arrays (FPGAs). While they have been studied extensively in academic literature, they find limited use in deployed systems. We review FPGA reconfiguration, looking at architectures built for the purpose, and the properties of modern commercial architectures. We then investigate design flows, and identify the key challenges in making reconfigurable FPGA systems easier to design. Finally, we look at applications where reconfiguration has found use, as well as proposing new areas where this capability places FPGAs in a unique position for adoption
Venice: Exploring Server Architectures for Effective Resource Sharing
Consolidated server racks are quickly becoming the backbone of IT infrastructure for science, engineering, and business, alike. These servers are still largely built and organized as when they were distributed, individual entities. Given that many fields increasingly rely on analytics of huge datasets, it makes sense to support flexible resource utilization across servers to improve cost-effectiveness and performance. We introduce Venice, a family of data-center server architectures that builds a strong communication substrate as a first-class resource for server chips. Venice provides a diverse set of resource-joining mechanisms that enables user programs to efficiently leverage non-local resources.
To better understand the implications of design decisions
about system support for resource sharing we have constructed a hardware prototype that allows us to more accurately measure end-to-end performance of at-scale applications and to explore tradeoffs among performance, power, and resource-sharing transparency. We present results from our initial studies analyzing these tradeoffs when sharing memory, accelerators, or NICs. We find that it is particularly important to reduce or hide latency, that data-sharing access patterns should match the features of the communication channels employed, and that inter-channel collaboration can be exploited for better performance
Customizing the Computation Capabilities of Microprocessors.
Designers of microprocessor-based systems must constantly improve
performance and increase computational efficiency in their designs to
create value. To this end, it is increasingly common to see
computation accelerators in general-purpose processor
designs. Computation accelerators collapse portions of an
application's dataflow graph, reducing the critical path of
computations, easing the burden on processor resources, and reducing
energy consumption in systems. There are many problems associated with
adding accelerators to microprocessors, though. Design of
accelerators, architectural integration, and software support all
present major challenges.
This dissertation tackles these challenges in the context of
accelerators targeting acyclic and cyclic patterns of
computation. First, a technique to identify critical computation
subgraphs within an application set is presented. This technique is
hardware-cognizant and effectively generates a set of instruction set
extensions given a domain of target applications. Next, several
general-purpose accelerator structures are quantitatively designed
using critical subgraph analysis for a broad application set.
The next challenge is architectural integration of
accelerators. Traditionally, software invokes accelerators by
statically encoding new instructions into the application binary. This
is incredibly costly, though, requiring many portions of hardware and
software to be redesigned. This dissertation develops strategies to
utilize accelerators, without changing the instruction set. In the
proposed approach, the microarchitecture translates applications at
run-time, replacing computation subgraphs with microcode to utilize
accelerators. We explore the tradeoffs in performing difficult aspects
of the translation at compile-time, while retaining run-time
replacement. This culminates in a simple microarchitectural interface
that supports a plug-and-play model for integrating accelerators into
a pre-designed microprocessor.
Software support is the last challenge in dealing with computation
accelerators. The primary issue is difficulty in generating
high-quality code utilizing accelerators. Hand-written assembly code
is standard in industry, and if compiler support does exist, simple
greedy algorithms are common. In this work, we investigate more
thorough techniques for compiling for computation accelerators. Where
greedy heuristics only explore one possible solution, the techniques
in this dissertation explore the entire design space, when
possible. Intelligent pruning methods ensure that compilation is both
tractable and scalable.Ph.D.Computer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/57633/2/ntclark_1.pd
Application of novel technologies for the development of next generation MR compatible PET inserts
Multimodal imaging integrating Positron Emission Tomography and Magnetic
Resonance Imaging (PET/MRI) has professed advantages as compared to other available
combinations, allowing both functional and structural information to be acquired with
very high precision and repeatability. However, it has yet to be adopted as the standard
for experimental and clinical applications, due to a variety of reasons mainly related to
system cost and flexibility. A hopeful existing approach of silicon photodetector-based MR
compatible PET inserts comprised by very thin PET devices that can be inserted in the
MRI bore, has been pioneered, without disrupting the market as expected. Technological
solutions that exist and can make this type of inserts lighter, cost-effective and more
adaptable to the application need to be researched further.
In this context, we expand the study of sub-surface laser engraving (SSLE) for
scintillators used for PET. Through acquiring, measuring and calibrating the use of a SSLE
setting we study the effect of different engraving configurations on detection
characteristics of the scintillation light by the photosensors. We demonstrate that apart
from cost-effectiveness and ease of application, SSLE treated scintillators have similar
spatial resolution and superior sensitivity and packing fraction as compared to standard
pixelated arrays, allowing for shorter crystals to be used. Flexibility of design is
benchmarked and adoption of honeycomb architecture due to geometrical advantages is
proposed. Furthermore, a variety of depth-of-interaction (DoI) designs are engraved and
studied, greatly enhancing applicability in small field-of-view tomographs, such as the
intended inserts. To adapt to this need, a novel approach for multi-layer DoI
characterization has been developed and is demonstrated.
Apart from crystal treatment, considerations on signal transmission and processing are
addressed. A double time-over-threshold (ToT) method is proposed, using the statistics of
noise in order to enhance precision. This method is tested and linearity results
demonstrate applicability for multiplexed readout designs. A study on analog optical
wireless communication (aOWC) techniques is also performed and proof of concept
results presented. Finally, a ToT readout firmware architecture, intended for low-cost
FPGAs, has been developed and is described.
By addressing the potential development, applicability and merits of a range of
transdisciplinary solutions, we demonstrate that with these techniques it is possible to
construct lighter, smaller, lower consumption, cost-effective MRI compatible PET inserts.
Those designs can make PET/MRI multimodality the dominant clinical and experimental
imaging approach, enhancing researcher and physician insight to the mysteries of life.La combinación multimodal de Tomografía por Emisión de Positrones con la Imagen de
Resonancia Magnética (PET/MRI, de sus siglas en inglés) tiene clara ventajas en
comparación con otras técnicas multimodales actualmente disponibles, dada su capacidad
para registrar información funcional e información estructural con mucha precisión y
repetibilidad. Sin embargo, esta técnica no acaba de penetrar en la práctica clínica debido
en gran parte a alto coste. Las investigaciones que persiguen mejorar el desarrollo de
insertos de PET basados en fotodetectores de silicio y compatibles con MRI, aunque han
sido intensas y han generado soluciones ingeniosas, todavía no han conseguido encontrar
las soluciones que necesita la industria. Sin embargo, existen opciones todavía sin explorar
que podrían ayudar a evolucionar este tipo de insertos consiguiendo dispositivos más
ligeros, baratos y con mejores prestaciones.
Esta tesis profundiza en el estudio de grabación sub-superficie con láser (SSLE) para el
diseño de los cristales centelladores usados en los sistemas PET. Para ello hemos
caracterizado, medido y calibrado un procedimiento SSLE, y a continuación hemos
estudiado el efecto que tienen sobre las especificaciones del detector las diferentes
configuraciones del grabado. Demostramos que además de la rentabilidad y facilidad de
uso de esta técnica, los centelladores SSLE tienen resolución espacial equivalente y
sensibilidad y fracción de empaquetamiento superiores a las matrices de centelleo
convencionales, lo que posibilita utilizar cristales más cortos para conseguir la misma
sensibilidad. Estos diseños también permiten medir la profundidad de la interacción (DoI),
lo que facilita el uso de estos diseños en tomógrafos de radio pequeño, como pueden ser
los sistemas preclínicos, los dedicados (cabeza o mama) o los insertos para MRI.
Además de trabajar en el tratamiento de cristal de centelleo, hemos considerado nuevas
aproximaciones al procesamiento y transmisión de la señal. Proponemos un método
innovador de doble medida de tiempo sobre el umbral (ToT) que integra una evaluación
de la estadística del ruido con el propósito de mejorar la precisión. El método se ha
validado y los resultados demuestran su viabilidad de uso incluso en conjuntos de señales
multiplexadas. Un estudio de las técnicas de comunicación óptica analógica e inalámbrica
(aOWC) ha permitido el desarrollo de una nueva propuesta para comunicar las señales del
detector PET insertado en el gantry a un el procesador de señal externo, técnica que se ha
validado en un demostrador. Finalmente, se ha propuesto y demostrado una nueva
arquitectura de análisis de señal ToT implementada en firmware en FPGAs de bajo coste.
La concepción y desarrollo de estas ideas, así como la evaluación de los méritos de las
diferentes soluciones propuestas, demuestran que con estas técnicas es posible construir
insertos de PET compatibles con sistemas MRI, que serán más ligeros y compactos, con un
reducido consumo y menor coste. De esta forma se contribuye a que la técnica multimodal
PET/MRI pueda penetrar en la clínica, mejorando la comprensión que médicos e
investigadores puedan alcanzar en su estudio de los misterios de la vida.Programa Oficial de Doctorado en Ingeniería Eléctrica, Electrónica y AutomáticaPresidente: Andrés Santos Lleó.- Secretario: Luis Hernández Corporales.- Vocal: Giancarlo Sportell
Optimizing the Use of Behavioral Locking for High-Level Synthesis
The globalization of the electronics supply chain requires effective methods to thwart reverse engineering and IP theft. Logic locking is a promising solution, but there are many open concerns. First, even when applied at a higher level of abstraction, locking may result in significant overhead without improving the security metric. Second, optimizing a security metric is application-dependent and designers must evaluate and compare alternative solutions. We propose a meta-framework to optimize the use of behavioral locking during the high-level synthesis (HLS) of IP cores. Our method operates on chip’s specification (before HLS) and it is compatible with all HLS tools, complementing industrial EDA flows. Our meta-framework supports different strategies to explore the design space and to select points to be locked automatically. We evaluated our method on the optimization of differential entropy, achieving better results than random or topological locking: 1) we always identify a valid solution that optimizes the security metric, while topological and random locking can generate unfeasible solutions; 2) we minimize the number of bits used for locking up to more than 90% (requiring smaller tamper-proof memories); 3) we make better use of hardware resources since we obtain similar overheads but with higher security metric
Automatic Creation of High-Bandwidth Memory Architectures from Domain-Specific Languages: The Case of Computational Fluid Dynamics
Numerical simulations can help solve complex problems. Most of these algorithms are massively parallel and thus good candidates for FPGA acceleration thanks to spatial parallelism. Modern FPGA devices can leverage high-bandwidth memory technologies, but when applications are memory-bound designers must craft advanced communication and memory architectures for efficient data movement and on-chip storage. This development process requires hardware design skills that are uncommon in domain-specific experts.
In this paper, we propose an automated tool flow from a domain-specific language (DSL) for tensor expressions to generate massively-parallel accelerators on HBM-equipped FPGAs. Designers can use this flow to integrate and evaluate various compiler or hardware optimizations. We use computational fluid dynamics (CFD) as a paradigmatic example.
Our flow starts from the high-level specification of tensor operations and combines an MLIR-based compiler with an in-house hardware generation flow to generate systems with parallel accelerators and a specialized memory architecture that moves data efficiently, aiming at fully exploiting the available CPU-FPGA bandwidth.
We simulated applications with millions of elements, achieving up to 103 GFLOPS with one compute unit and custom precision when targeting a Xilinx Alveo U280. Our FPGA implementation is up to 25x more energy efficient than expert-crafted Intel CPU implementations
A RECONFIGURABLE AND EXTENSIBLE EXPLORATION PLATFORM FOR FUTURE HETEROGENEOUS SYSTEMS
Accelerator-based -or heterogeneous- computing has become increasingly
important in a variety of scenarios, ranging from High-Performance Computing (HPC) to embedded systems. While most solutions use sometimes
custom-made components, most of today’s systems rely on commodity highend CPUs and/or GPU devices, which deliver adequate performance while
ensuring programmability, productivity, and application portability. Unfortunately, pure general-purpose hardware is affected by inherently limited
power-efficiency, that is, low GFLOPS-per-Watt, now considered as a primary metric. The many-core model and architectural customization can
play here a key role, as they enable unprecedented levels of power-efficiency
compared to CPUs/GPUs. However, such paradigms are still immature and
deeper exploration is indispensable.
This dissertation investigates customizability and proposes novel solutions
for heterogeneous architectures, focusing on mechanisms related to coherence and network-on-chip (NoC). First, the work presents a non-coherent
scratchpad memory with a configurable bank remapping system to reduce
bank conflicts. The experimental results show the benefits of both using a
customizable hardware bank remapping function and non-coherent memories for some types of algorithms. Next, we demonstrate how a distributed
synchronization master better suits many-cores than standard centralized
solutions. This solution, inspired by the directory-based coherence mechanism, supports concurrent synchronizations without relying on memory
transactions. The results collected for different NoC sizes provided indications about the area overheads incurred by our solution and demonstrated
the benefits of using a dedicated hardware synchronization support. Finally, this dissertation proposes an advanced coherence subsystem, based
on the sparse directory approach, with a selective coherence maintenance
system which allows coherence to be deactivated for blocks that do not require it. Experimental results show that the use of a hybrid coherent and
non-coherent architectural mechanism along with an extended coherence
protocol can enhance performance.
The above results were all collected by means of a modular and customizable heterogeneous many-core system developed to support the exploration
of power-efficient high-performance computing architectures. The system is
based on a NoC and a customizable GPU-like accelerator core, as well as
a reconfigurable coherence subsystem, ensuring application-specific configuration capabilities. All the explored solutions were evaluated on this real heterogeneous system, which comes along with the above methodological
results as part of the contribution in this dissertation. In fact, as a key
benefit, the experimental platform enables users to integrate novel hardware/software solutions on a full-system scale, whereas existing platforms
do not always support a comprehensive heterogeneous architecture exploration
- …