8 research outputs found

    Empowering parallel computing with field programmable gate arrays

    Get PDF
    After more than 30 years, reconfigurable computing has grown from a concept to a mature field of science and technology. The cornerstone of this evolution is the field programmable gate array, a building block enabling the configuration of a custom hardware architecture. The departure from static von Neumannlike architectures opens the way to eliminate the instruction overhead and to optimize the execution speed and power consumption. FPGAs now live in a growing ecosystem of development tools, enabling software programmers to map algorithms directly onto hardware. Applications abound in many directions, including data centers, IoT, AI, image processing and space exploration. The increasing success of FPGAs is largely due to an improved toolchain with solid high-level synthesis support as well as a better integration with processor and memory systems. On the other hand, long compile times and complex design exploration remain areas for improvement. In this paper we address the evolution of FPGAs towards advanced multi-functional accelerators, discuss different programming models and their HLS language implementations, as well as high-performance tuning of FPGAs integrated into a heterogeneous platform. We pinpoint fallacies and pitfalls, and identify opportunities for language enhancements and architectural refinements

    Generación de un Módulo Optimizado de Inferencia en FPGAs con HLS

    Full text link
    [ES] Las FPGAs (field-programmable gate array) pueden ser utilizadas para la inferencia de modelos de Redes Neuronales en sistemas embebidos, dado que este tipo de dispositivo presenta una alta eficiencia energética y un alto rendimiento. Asimismo, la posibilidad de diseñar hardware mediante High-level synthesis (HLS) ha disminuido la cantidad de esfuerzo necesario para el desarrollo de código para FPGAs. Por otra parte, existen escenarios en los cuales no se pueda realizar la inferencia completa sobre FPGAs, necesitando un dispositivo CPU para la ejecución de las partes no soportadas. En este trabajo, se ha utilizado el acelerador HLSinf. HLSinf es una implementación de HLS de código abierto de aceleradores personalizados para procesos de inferencia de Redes Neuronales sobre dispositivos FPGA. Además, en este proyecto, se han desarrollado nuevos módulos dentro de este acelerador. Asimismo, se ha integrado el acelerador con la libreria EDDL (European Distributed Deep Learning library), la cual permite la ejecución de modelos sobre varios dispositivos.[CA] Les FPGAs (field-programmable gate array) poden ser utilitzades per a la inferència de models de Xarxes Neuronals en sistemes encastats, donat que aquest tipus de disposi- tiu presenten una alta eficiència energètica i un alt rendiment. Així mateix, la possibilitat de dissenyar maquinari mitjançant High-level synthesis (HLS) ha disminuït la quantitat d’esforç necessari per al desenvolupament de codi per a PGAs. D’altra banda, hi ha escenaris en els quals no es posible realitzar la inferència comple- ta sobre FPGAs, necessitant un dispositiu CPU per a l’execució de les parts no soportades. En aquest treball, s’ha utilitzat l’accelerador HLSinf. HLSinf és una implementació en HLS de codi obert d’acceleradors personalitzats per a processos d’inferència de Xar- xes Neuronals sobre dispositius FPGAs. A més, en aquest projecte, s’han desenvolupat nous mòduls dins d’aquest accelerador. Així mateix, s’ha integrat l’accelerador amb la llibreria EDDL (European Distributed Deep Learning library), la qual permet l’execució de models sobre diversos dispositius.[EN] FPGAs (field-programmable gate array) can be used for the inference of Neural Networks models in embedded systems since this type of device presents high energy efficiency and high performance. Moreover, the ability to design hardware using High-level synthesis (HLS) has decreased the amount of effort required to develop code for FPGAs. On the other hand, there are scenarios in which the complete inference on FPGAs cannot be performed, requiring a CPU device to execute the unsupported parts. In this work, the accelerator HLSinf has been used. HLSinf is an open-source HLS implementation of custom accelerators for DeepLearning inference processes. on FPGA devices. Furthermore, in this project, new modules have been developed within this accelerator. Moreover, the accelerator has been integrated with the EDDL (European Distributed Deep Learning library) library, which allows the execution of models on various devices.Medina Chaveli, L. (2021). Generación de un Módulo Optimizado de Inferencia en FPGAs con HLS. Universitat Politècnica de València. http://hdl.handle.net/10251/178237TFG

    FBLAS: Streaming Linear Algebra on FPGA

    Full text link
    Spatial computing architectures pose an attractive alternative to mitigate control and data movement overheads typical of load-store architectures. In practice, these devices are rarely considered in the HPC community due to the steep learning curve, low productivity and lack of available libraries for fundamental operations. High-level synthesis (HLS) tools are facilitating hardware programming, but optimizing for these architectures requires factoring in new transformations and resources/performance trade-offs. We present FBLAS, an open-source HLS implementation of BLAS for FPGAs, that enables reusability, portability and easy integration with existing software and hardware codes. FBLAS' implementation allows scaling hardware modules to exploit on-chip resources, and module interfaces are designed to natively support streaming on-chip communications, allowing them to be composed to reduce off-chip communication. With FBLAS, we set a precedent for FPGA library design, and contribute to the toolbox of customizable hardware components necessary for HPC codes to start productively targeting reconfigurable platforms

    Accelerating Generalized Linear Models with MLWeaving: A One-Size-Fits-All System for Any-precision Learning (Technical Report)

    Full text link
    Learning from the data stored in a database is an important function increasingly available in relational engines. Methods using lower precision input data are of special interest given their overall higher efficiency but, in databases, these methods have a hidden cost: the quantization of the real value into a smaller number is an expensive step. To address the issue, in this paper we present MLWeaving, a data structure and hardware acceleration technique intended to speed up learning of generalized linear models in databases. ML-Weaving provides a compact, in-memory representation enabling the retrieval of data at any level of precision. MLWeaving also takes advantage of the increasing availability of FPGA-based accelerators to provide a highly efficient implementation of stochastic gradient descent. The solution adopted in MLWeaving is more efficient than existing designs in terms of space (since it can process any resolution on the same design) and resources (via the use of bit-serial multipliers). MLWeaving also enables the runtime tuning of precision, instead of a fixed precision level during the training. We illustrate this using a simple, dynamic precision schedule. Experimental results show MLWeaving achieves up to16 performance improvement over low-precision CPU implementations of first-order methods.Comment: 18 page

    FPGA Acceleration of Domain-specific Kernels via High-Level Synthesis

    Get PDF
    L'abstract è presente nell'allegato / the abstract is in the attachmen

    ACiS: smart switches with application-level acceleration

    Full text link
    Network performance has contributed fundamentally to the growth of supercomputing over the past decades. In parallel, High Performance Computing (HPC) peak performance has depended, first, on ever faster/denser CPUs, and then, just on increasing density alone. As operating frequency, and now feature size, have levelled off, two new approaches are becoming central to achieving higher net performance: configurability and integration. Configurability enables hardware to map to the application, as well as vice versa. Integration enables system components that have generally been single function-e.g., a network to transport data—to have additional functionality, e.g., also to operate on that data. More generally, integration enables compute-everywhere: not just in CPU and accelerator, but also in network and, more specifically, the communication switches. In this thesis, we propose four novel methods of enhancing HPC performance through Advanced Computing in the Switch (ACiS). More specifically, we propose various flexible and application-aware accelerators that can be embedded into or attached to existing communication switches to improve the performance and scalability of HPC and Machine Learning (ML) applications. We follow a modular design discipline through introducing composable plugins to successively add ACiS capabilities. In the first work, we propose an inline accelerator to communication switches for user-definable collective operations. MPI collective operations can often be performance killers in HPC applications; we seek to solve this bottleneck by offloading them to reconfigurable hardware within the switch itself. We also introduce a novel mechanism that enables the hardware to support MPI communicators of arbitrary shape and that is scalable to very large systems. In the second work, we propose a look-aside accelerator for communication switches that is capable of processing packets at line-rate. Functions requiring loops and states are addressed in this method. The proposed in-switch accelerator is based on a RISC-V compatible Coarse Grained Reconfigurable Arrays (CGRAs). To facilitate usability, we have developed a framework to compile user-provided C/C++ codes to appropriate back-end instructions for configuring the accelerator. In the third work, we extend ACiS to support fused collectives and the combining of collectives with map operations. We observe that there is an opportunity of fusing communication (collectives) with computation. Since the computation can vary for different applications, ACiS support should be programmable in this method. In the fourth work, we propose that switches with ACiS support can control and manage the execution of applications, i.e., that the switch be an active device with decision-making capabilities. Switches have a central view of the network; they can collect telemetry information and monitor application behavior and then use this information for control, decision-making, and coordination of nodes. We evaluate the feasibility of ACiS through extensive RTL-based simulation as well as deployment in an open-access cloud infrastructure. Using this simulation framework, when considering a Graph Convolutional Network (GCN) application as a case study, a speedup of on average 3.4x across five real-world datasets is achieved on 24 nodes compared to a CPU cluster without ACiS capabilities

    3rd Many-core Applications Research Community (MARC) Symposium. (KIT Scientific Reports ; 7598)

    Get PDF
    This manuscript includes recent scientific work regarding the Intel Single Chip Cloud computer and describes approaches for novel approaches for programming and run-time organization
    corecore