656 research outputs found

    High Performance Computing via High Level Synthesis

    Get PDF
    As more and more powerful integrated circuits are appearing on the market, more and more applications, with very different requirements and workloads, are making use of the available computing power. This thesis is in particular devoted to High Performance Computing applications, where those trends are carried to the extreme. In this domain, the primary aspects to be taken into consideration are (1) performance (by definition) and (2) energy consumption (since operational costs dominate over procurement costs). These requirements can be satisfied more easily by deploying heterogeneous platforms, which include CPUs, GPUs and FPGAs to provide a broad range of performance and energy-per-operation choices. In particular, as we will see, FPGAs clearly dominate both CPUs and GPUs in terms of energy, and can provide comparable performance. An important aspect of this trend is of course design technology, because these applications were traditionally programmed in high-level languages, while FPGAs required low-level RTL design. The OpenCL (Open Computing Language) developed by the Khronos group enables developers to program CPU, GPU and recently FPGAs using functionally portable (but sadly not performance portable) source code which creates new possibilities and challenges both for research and industry. FPGAs have been always used for mid-size designs and ASIC prototyping thanks to their energy efficient and flexible hardware architecture, but their usage requires hardware design knowledge and laborious design cycles. Several approaches are developed and deployed to address this issue and shorten the gap between software and hardware in FPGA design flow, in order to enable FPGAs to capture a larger portion of the hardware acceleration market in data centers. Moreover, FPGAs usage in data centers is growing already, regardless of and in addition to their use as computational accelerators, because they can be used as high performance, low power and secure switches inside data-centers. High-Level Synthesis (HLS) is the methodology that enables designers to map their applications on FPGAs (and ASICs). It synthesizes parallel hardware from a model originally written C-based programming languages .e.g. C/C++, SystemC and OpenCL. Design space exploration of the variety of implementations that can be obtained from this C model is possible through wide range of optimization techniques and directives, e.g. to pipeline loops and partition memories into multiple banks, which guide RTL generation toward application dependent hardware and benefit designers from flexible parallel architecture of FPGAs. Model Based Design (MBD) is a high-level and visual process used to generate implementations that solve mathematical problems through a varied set of IP-blocks. MBD enables developers with different expertise, e.g. control theory, embedded software development, and hardware design to share a common design framework and contribute to a shared design using the same tool. Simulink, developed by MATLAB, is a model based design tool for simulation and development of complex dynamical systems. Moreover, Simulink embedded code generators can produce verified C/C++ and HDL code from the graphical model. This code can be used to program micro-controllers and FPGAs. This PhD thesis work presents a study using automatic code generator of Simulink to target Xilinx FPGAs using both HDL and C/C++ code to demonstrate capabilities and challenges of high-level synthesis process. To do so, firstly, digital signal processing unit of a real-time radar application is developed using Simulink blocks. Secondly, generated C based model was used for high level synthesis process and finally the implementation cost of HLS is compared to traditional HDL synthesis using Xilinx tool chain. Alternative to model based design approach, this work also presents an analysis on FPGA programming via high-level synthesis techniques for computationally intensive algorithms and demonstrates the importance of HLS by comparing performance-per-watt of GPUs(NVIDIA) and FPGAs(Xilinx) manufactured in the same node running standard OpenCL benchmarks. We conclude that generation of high quality RTL from OpenCL model requires stronger hardware background with respect to the MBD approach, however, the availability of a fast and broad design space exploration ability and portability of the OpenCL code, e.g. to CPUs and GPUs, motivates FPGA industry leaders to provide users with OpenCL software development environment which promises FPGA programming in CPU/GPU-like fashion. Our experiments, through extensive design space exploration(DSE), suggest that FPGAs have higher performance-per-watt with respect to two high-end GPUs manufactured in the same technology(28 nm). Moreover, FPGAs with more available resources and using a more modern process (20 nm) can outperform the tested GPUs while consuming much less power at the cost of more expensive devices

    DAPHNE: An Open and Extensible System Infrastructure for Integrated Data Analysis Pipelines

    Get PDF
    Integrated data analysis (IDA) pipelines—that combine data management (DM) and query processing, high-performance computing (HPC), and machine learning (ML) training and scoring—become increasingly common in practice. Interestingly, systems of these areas share many compilation and runtime techniques, and the used—increasingly heterogeneous—hardware infrastructure converges as well. Yet, the programming paradigms, cluster resource management, data formats and representations, as well as execution strategies differ substantially. DAPHNE is an open and extensible system infrastructure for such IDA pipelines, including language abstractions, compilation and runtime techniques, multi-level scheduling, hardware (HW) accelerators, and computational storage for increasing productivity and eliminating unnecessary overheads. In this paper, we make a case for IDA pipelines, describe the overall DAPHNE system architecture, its key components, and the design of a vectorized execution engine for computational storage, HW accelerators, as well as local and distributed operations. Preliminary experiments that compare DAPHNE with MonetDB, Pandas, DuckDB, and TensorFlow show promising results

    Real-time implementation of 3D LiDAR point cloud semantic segmentation in an FPGA

    Get PDF
    Dissertação de mestrado em Informatics EngineeringIn the last few years, the automotive industry has relied heavily on deep learning applications for perception solutions. With data-heavy sensors, such as LiDAR, becoming a standard, the task of developing low-power and real-time applications has become increasingly more challenging. To obtain the maximum computational efficiency, no longer can one focus solely on the software aspect of such applications, while disregarding the underlying hardware. In this thesis, a hardware-software co-design approach is used to implement an inference application leveraging the SqueezeSegV3, a LiDAR-based convolutional neural network, on the Versal ACAP VCK190 FPGA. Automotive requirements carefully drive the development of the proposed solution, with real-time performance and low power consumption being the target metrics. A first experiment validates the suitability of Xilinx’s Vitis-AI tool for the deployment of deep convolutional neural networks on FPGAs. Both the ResNet-18 and SqueezeNet neural networks are deployed to the Zynq UltraScale+ MPSoC ZCU104 and Versal ACAP VCK190 FPGAs. The results show that both networks achieve far more than the real-time requirements while consuming low power. Compared to an NVIDIA RTX 3090 GPU, the performance per watt during both network’s inference is 12x and 47.8x higher and 15.1x and 26.6x higher respectively for the Zynq UltraScale+ MPSoC ZCU104 and the Versal ACAP VCK190 FPGA. These results are obtained with no drop in accuracy in the quantization step. A second experiment builds upon the results of the first by deploying a real-time application containing the SqueezeSegV3 model using the Semantic-KITTI dataset. A framerate of 11 Hz is achieved with a peak power consumption of 78 Watts. The quantization step results in a minimal accuracy and IoU degradation of 0.7 and 1.5 points respectively. A smaller version of the same model is also deployed achieving a framerate of 19 Hz and a peak power consumption of 76 Watts. The application performs semantic segmentation over all the point cloud with a field of view of 360°.Nos últimos anos a indústria automóvel tem cada vez mais aplicado deep learning para solucionar problemas de perceção. Dado que os sensores que produzem grandes quantidades de dados, como o LiDAR, se têm tornado standard, a tarefa de desenvolver aplicações de baixo consumo energético e com capacidades de reagir em tempo real tem-se tornado cada vez mais desafiante. Para obter a máxima eficiência computacional, deixou de ser possível focar-se apenas no software aquando do desenvolvimento de uma aplicação deixando de lado o hardware subjacente. Nesta tese, uma abordagem de desenvolvimento simultâneo de hardware e software é usada para implementar uma aplicação de inferência usando o SqueezeSegV3, uma rede neuronal convolucional profunda, na FPGA Versal ACAP VCK190. São os requisitos automotive que guiam o desenvolvimento da solução proposta, sendo a performance em tempo real e o baixo consumo energético, as métricas alvo principais. Uma primeira experiência valida a aptidão da ferramenta Vitis-AI para a implantação de redes neuronais convolucionais profundas em FPGAs. As redes ResNet-18 e SqueezeNet são ambas implantadas nas FPGAs Zynq UltraScale+ MPSoC ZCU104 e Versal ACAP VCK190. Os resultados mostram que ambas as redes ultrapassam os requisitos de tempo real consumindo pouca energia. Comparado com a GPU NVIDIA RTX 3090, a performance por Watt durante a inferência de ambas as redes é superior em 12x e 47.8x e 15.1x e 26.6x respetivamente na Zynq UltraScale+ MPSoC ZCU104 e na Versal ACAP VCK190. Estes resultados foram obtidos sem qualquer perda de accuracy na etapa de quantização. Uma segunda experiência é feita no seguimento dos resultados da primeira, implantando uma aplicação de inferência em tempo real contendo o modelo SqueezeSegV3 e usando o conjunto de dados Semantic-KITTI. Um framerate de 11 Hz é atingido com um pico de consumo energético de 78 Watts. O processo de quantização resulta numa perda mínima de accuracy e IoU com valores de 0.7 e 1.5 pontos respetivamente. Uma versão mais pequena do mesmo modelo é também implantada, atingindo uma framerate de 19 Hz e um pico de consumo energético de 76 Watts. A aplicação desenvolvida executa segmentação semântica sobre a totalidade das nuvens de pontos LiDAR, com um campo de visão de 360°

    Design Space Exploration of DNNs for Autonomous Systems

    Get PDF
    Indiana University-Purdue University Indianapolis (IUPUI)Developing intelligent agents that can perceive and understand the rich visualworld around us has been a long-standing goal in the field of AI. Recently, asignificant progress has been made by the CNNs/DNNs to the incredible advances& in a wide range of applications such as ADAS, intelligent cameras surveillance,autonomous systems, drones, & robots. Design space exploration (DSE) of NNs andother techniques have made CNN/DNN memory & computationally efficient. Butthe major design hurdles for deployment are limited resources such as computation,memory, energy efficiency, and power budget. DSE of small DNN architectures forADAS emerged with better and efficient architectures such as baseline SqueezeNetand SqueezeNext. These architectures are exclusively known for their small modelsize, good model speed & model accuracy.In this thesis study, two new DNN architectures are proposed. Before diving intothe proposed architectures, DSE of DNNs explores the methods to improveDNNs/CNNs.Further, understanding the different hyperparameters tuning &experimenting with various optimizers and newly introduced methodologies. First,High Performance SqueezeNext architecture ameliorate the performance of existingDNN architectures. The intuition behind this proposed architecture is to supplantconvolution layers with a more sophisticated block module & to develop a compactand efficient architecture with a competitive accuracy. Second, Shallow SqueezeNextarchitecture is proposed which achieves better model size results in comparison tobaseline SqueezeNet and SqueezeNext is presented. It illustrates the architecture is xviicompact, efficient and flexible in terms of model size and accuracy.Thestate-of-the-art SqueezeNext baseline and SqueezeNext baseline are used as thefoundation to recreate and propose the both DNN architectures in this study. Dueto very small model size with competitive model accuracy and decent model testingspeed it is expected to perform well on the ADAS systems.The proposedarchitectures are trained and tested from scratch on CIFAR-10 [30] & CIFAR-100[34] datasets. All the training and testing results are visualized with live loss andaccuracy graphs by using livelossplot. In the last, both of the proposed DNNarchitectures are deployed on BlueBox2.0 by NXP

    Sharing GPUs for Real-Time Autonomous-Driving Systems

    Get PDF
    Autonomous vehicles at mass-market scales are on the horizon. Cameras are the least expensive among common sensor types and can preserve features such as color and texture that other sensors cannot. Therefore, realizing full autonomy in vehicles at a reasonable cost is expected to entail computer-vision techniques. These computer-vision applications require massive parallelism provided by the underlying shared accelerators, such as graphics processing units, or GPUs, to function “in real time.” However, when computer-vision researchers and GPU vendors refer to “real time,” they usually mean “real fast”; in contrast, certifiable automotive systems must be “real time” in the sense of being predictable. This dissertation addresses the challenging problem of how GPUs can be shared predictably and efficiently for real-time autonomous-driving systems. We tackle this challenge in four steps. First, we investigate NVIDIA GPUs with respect to scheduling, synchronization, and execution. We conduct an extensive set of experiments to infer NVIDIA GPU scheduling rules, which are unfortunately undisclosed by NVIDIA and are beyond access owing to their closed-source software stack. We also expose a list of pitfalls pertaining to CPU-GPU synchronization that can result in unbounded response times of GPU-using applications. Lastly, we examine a fundamental trade-off for designing real-time tasks under different execution options. Overall, our investigation provides an essential understanding of NVIDIA GPUs, allowing us to further model and analyze GPU tasks. Second, we develop a new model and conduct schedulability analysis for GPU tasks. We extend the well-studied sporadic task model with additional parameters that characterize the parallel execution of GPU tasks. We show that NVIDIA scheduling rules are subject to fundamental capacity loss, which implies a necessary total utilization bound. We derive response-time bounds for GPU task systems that satisfy our schedulability conditions. Third, we address an industrial challenge of supplying the throughput performance of computer-vision frameworks to support adequate coverage and redundancy offered by an array of cameras. We re-think the design of convolution neural network (CNN) software to better utilize hardware resources and achieve increased throughput (number of simultaneous camera streams) without any appreciable increase in per-frame latency (camera to CNN output) or reduction of per-stream accuracy. Fourth, we apply our analysis to a finer-grained graph scheduling of a computer-vision standard, OpenVX, which explicitly targets embedded and real-time systems. We evaluate both the analytical and empirical real-time performance of our approach.Doctor of Philosoph

    Towards a Common Software/Hardware Methodology for Future Advanced Driver Assistance Systems

    Get PDF
    The European research project DESERVE (DEvelopment platform for Safe and Efficient dRiVE, 2012-2015) had the aim of designing and developing a platform tool to cope with the continuously increasing complexity and the simultaneous need to reduce cost for future embedded Advanced Driver Assistance Systems (ADAS). For this purpose, the DESERVE platform profits from cross-domain software reuse, standardization of automotive software component interfaces, and easy but safety-compliant integration of heterogeneous modules. This enables the development of a new generation of ADAS applications, which challengingly combine different functions, sensors, actuators, hardware platforms, and Human Machine Interfaces (HMI). This book presents the different results of the DESERVE project concerning the ADAS development platform, test case functions, and validation and evaluation of different approaches. The reader is invited to substantiate the content of this book with the deliverables published during the DESERVE project. Technical topics discussed in this book include:Modern ADAS development platforms;Design space exploration;Driving modelling;Video-based and Radar-based ADAS functions;HMI for ADAS;Vehicle-hardware-in-the-loop validation system

    Real-time multi-domain optimization controller for multi-motor electric vehicles using automotive-suitable methods and heterogeneous embedded platforms

    Get PDF
    Los capítulos 2,3 y 7 están sujetos a confidencialidad por el autor. 145 p.In this Thesis, an elaborate control solution combining Machine Learning and Soft Computing techniques has been developed, targeting a chal lenging vehicle dynamics application aiming to optimize the torque distribution across the wheels with four independent electric motors.The technological context that has motivated this research brings together potential -and challenges- from multiple dom ains: new automotive powertrain topologies with increased degrees of freedom and controllability, which can be approached with innovative Machine Learning algorithm concepts, being implementable by exploiting the computational capacity of modern heterogeneous embedded platforms and automated toolchains. The complex relations among these three domains that enable the potential for great enhancements, do contrast with the fourth domain in this context: challenging constraints brought by industrial aspects and safe ty regulations. The innovative control architecture that has been conce ived combines Neural Networks as Virtual Sensor for unmeasurable forces , with a multi-objective optimization function driven by Fuzzy Logic , which defines priorities basing on the real -time driving situation. The fundamental principle is to enhance vehicle dynamics by implementing a Torque Vectoring controller that prevents wheel slip using the inputs provided by the Neural Network. Complementary optimization objectives are effici ency, thermal stress and smoothness. Safety -critical concerns are addressed through architectural and functional measures.Two main phases can be identified across the activities and milestones achieved in this work. In a first phase, a baseline Torque Vectoring controller was implemented on an embedded platform and -benefiting from a seamless transition using Hardware-in -the -Loop - it was integrated into a real Motor -in -Wheel vehicle for race track tests. Having validated the concept, framework, methodology and models, a second simulation-based phase proceeds to develop the more sophisticated controller, targeting a more capable vehicle, leading to the final solution of this work. Besides, this concept was further evolved to support a joint research work which lead to outstanding FPGA and GPU based embedded implementations of Neural Networks. Ultimately, the different building blocks that compose this work have shown results that have met or exceeded the expectations, both on technical and conceptual level. The highly non-linear multi-variable (and multi-objective) control problem was tackled. Neural Network estimations are accurate, performance metrics in general -and vehicle dynamics and efficiency in particular- are clearly improved, Fuzzy Logic and optimization behave as expected, and efficient embedded implementation is shown to be viable. Consequently, the proposed control concept -and the surrounding solutions and enablers- have proven their qualities in what respects to functionality, performance, implementability and industry suitability.The most relevant contributions to be highlighted are firstly each of the algorithms and functions that are implemented in the controller solutions and , ultimately, the whole control concept itself with the architectural approaches it involves. Besides multiple enablers which are exploitable for future work have been provided, as well as an illustrative insight into the intricacies of a vivid technological context, showcasing how they can be harmonized. Furthermore, multiple international activities in both academic and professional contexts -which have provided enrichment as well as acknowledgement, for this work-, have led to several publications, two high-impact journal papers and collateral work products of diverse nature
    corecore