55 research outputs found
GNNBuilder: An Automated Framework for Generic Graph Neural Network Accelerator Generation, Simulation, and Optimization
There are plenty of graph neural network (GNN) accelerators being proposed.
However, they highly rely on users' hardware expertise and are usually
optimized for one specific GNN model, making them challenging for practical use
. Therefore, in this work, we propose GNNBuilder, the first automated, generic,
end-to-end GNN accelerator generation framework. It features four advantages:
(1) GNNBuilder can automatically generate GNN accelerators for a wide range of
GNN models arbitrarily defined by users; (2) GNNBuilder takes standard PyTorch
programming interface, introducing zero overhead for algorithm developers; (3)
GNNBuilder supports end-to-end code generation, simulation, accelerator
optimization, and hardware deployment, realizing a push-button fashion for GNN
accelerator design; (4) GNNBuilder is equipped with accurate performance models
of its generated accelerator, enabling fast and flexible design space
exploration (DSE). In the experiments, first, we show that our accelerator
performance model has errors within for latency prediction and
for BRAM count prediction. Second, we show that our generated accelerators can
outperform CPU by and GPU by . This framework is
open-source, and the code is available at
https://anonymous.4open.science/r/gnn-builder-83B4/.Comment: 10 pages, 7 figures, 4 tables, 3 listing
ACiS: smart switches with application-level acceleration
Network performance has contributed fundamentally to the growth of supercomputing over the past decades. In parallel, High Performance Computing (HPC) peak performance has depended, first, on ever faster/denser CPUs, and then, just on increasing density alone. As operating frequency, and now feature size, have levelled off, two new approaches are becoming central to achieving higher net performance: configurability and integration. Configurability enables hardware to map to the application, as well as vice versa. Integration enables system components that have generally been single function-e.g., a network to transport data—to have additional functionality, e.g., also to operate on that data. More generally, integration enables compute-everywhere: not just in CPU and accelerator, but also in network and, more specifically, the communication switches.
In this thesis, we propose four novel methods of enhancing HPC performance through Advanced Computing in the Switch (ACiS). More specifically, we propose various flexible and application-aware accelerators that can be embedded into or attached to existing communication switches to improve the performance and scalability of HPC and Machine Learning (ML) applications. We follow a modular design discipline through introducing composable plugins to successively add ACiS capabilities.
In the first work, we propose an inline accelerator to communication switches for user-definable collective operations. MPI collective operations can often be performance killers in HPC applications; we seek to solve this bottleneck by offloading them to reconfigurable hardware within the switch itself. We also introduce a novel mechanism that enables the hardware to support MPI communicators of arbitrary shape and that is scalable to very large systems.
In the second work, we propose a look-aside accelerator for communication switches that is capable of processing packets at line-rate. Functions requiring loops and states are addressed in this method. The proposed in-switch accelerator is based on a RISC-V compatible Coarse Grained Reconfigurable Arrays (CGRAs).
To facilitate usability, we have developed a framework to compile user-provided C/C++ codes to appropriate back-end instructions for configuring the accelerator.
In the third work, we extend ACiS to support fused collectives and the combining of collectives with map operations. We observe that there is an opportunity of fusing communication (collectives) with computation. Since the computation can vary for different applications, ACiS support should be programmable in this method.
In the fourth work, we propose that switches with ACiS support can control and manage the execution of applications, i.e., that the switch be an active device with decision-making capabilities. Switches have a central view of the network; they can collect telemetry information and monitor application behavior and then use this information for control, decision-making, and coordination of nodes.
We evaluate the feasibility of ACiS through extensive RTL-based simulation as well as deployment in an open-access cloud infrastructure. Using this simulation framework, when considering a Graph Convolutional Network (GCN) application as a case study, a speedup of on average 3.4x across five real-world datasets is achieved on 24 nodes compared to a CPU cluster without ACiS capabilities
Modelling and Analysis of FPGA-based MPSoC System with Multiple DNN Accelerators
Deep Neural Networks (DNNs) have been widely applied in many fields for decades, and a standard method for deploying them on embedded systems involves using accelerators. However, due to the resource constraints of embedded systems, improving energy and computing efficiency becomes one of the research challenges in this domain. DNN model optimization and NAS (Neural Architecture Searching) are commonly used to strengthen the DNN model running efficiency on an embedded system. However, because the system’s runtime workloads are varied in practical situations, to further improve the computing efficiency of the system at runtime, real-time hardware and software design space exploration is required to ensure the system is running at the optimal time state at runtime. This paper presents a comprehensive modelling and analysis approach for the performance data (e.g., latency, energy consumption, accuracy, etc.) collected from an AMD-Xilinx heterogeneous MPSoC platform equipped with multiple DNN accelerators. The results demonstrate that the relationships between accuracy loss, hardware performance, and model size are significantly correlated. Furthermore, an appropriate hardware and software configuration could be obtained by giving constraints at runtime
Novel Architectures for Offloading and Accelerating Computations in Artificial Intelligence and Big Data
Due to the end of Moore's Law and Dennard Scaling, performance gains in general-purpose architectures have significantly slowed in recent years. While raising the number of cores has been a viable approach for further performance increases, Amdahl's Law and its implications on parallelization also limit further performance gains. Consequently, research has shifted towards different approaches, including domain-specific custom architectures tailored to specific workloads.
This has led to a new golden age for computer architecture, as noted in the Turing Award Lecture by Hennessy and Patterson, which has spawned several new architectures and architectural advances specifically targeted at highly current workloads, including Machine Learning. This thesis introduces a hierarchy of architectural improvements ranging from minor incremental changes, such as High-Bandwidth Memory, to more complex architectural extensions that offload workloads from the general-purpose CPU towards more specialized accelerators. Finally, we introduce novel architectural paradigms, namely Near-Data or In-Network Processing, as the most complex architectural improvements.
This cumulative dissertation then investigates several architectural improvements to accelerate Sum-Product Networks, a novel Machine Learning approach from the class of Probabilistic Graphical Models. Furthermore, we use these improvements as case studies to discuss the impact of novel architectures, showing that minor and major architectural changes can significantly increase performance in Machine Learning applications.
In addition, this thesis presents recent works on Near-Data Processing, which introduces Smart Storage Devices as a novel architectural paradigm that is especially interesting in the context of Big Data. We discuss how Near-Data Processing can be applied to improve performance in different database settings by offloading database operations to smart storage devices. Offloading data-reductive operations, such as selections, reduces the amount of data transferred, thus improving performance and alleviating bandwidth-related bottlenecks.
Using Near-Data Processing as a use-case, we also discuss how Machine Learning approaches, like Sum-Product Networks, can improve novel architectures. Specifically, we introduce an approach for offloading Cardinality Estimation using Sum-Product Networks that could enable more intelligent decision-making in smart storage devices. Overall, we show that Machine Learning can benefit from developing novel architectures while also showing that Machine Learning can be applied to improve the applications of novel architectures
Artificial Intelligence in Image-Based Screening, Diagnostics, and Clinical Care of Cardiopulmonary Diseases
Cardiothoracic and pulmonary diseases are a significant cause of mortality and morbidity worldwide. The COVID-19 pandemic has highlighted the lack of access to clinical care, the overburdened medical system, and the potential of artificial intelligence (AI) in improving medicine. There are a variety of diseases affecting the cardiopulmonary system including lung cancers, heart disease, tuberculosis (TB), etc., in addition to COVID-19-related diseases. Screening, diagnosis, and management of cardiopulmonary diseases has become difficult owing to the limited availability of diagnostic tools and experts, particularly in resource-limited regions. Early screening, accurate diagnosis and staging of these diseases could play a crucial role in treatment and care, and potentially aid in reducing mortality. Radiographic imaging methods such as computed tomography (CT), chest X-rays (CXRs), and echo ultrasound (US) are widely used in screening and diagnosis. Research on using image-based AI and machine learning (ML) methods can help in rapid assessment, serve as surrogates for expert assessment, and reduce variability in human performance. In this Special Issue, “Artificial Intelligence in Image-Based Screening, Diagnostics, and Clinical Care of Cardiopulmonary Diseases”, we have highlighted exemplary primary research studies and literature reviews focusing on novel AI/ML methods and their application in image-based screening, diagnosis, and clinical management of cardiopulmonary diseases. We hope that these articles will help establish the advancements in AI
Real-time implementation of 3D LiDAR point cloud semantic segmentation in an FPGA
Dissertação de mestrado em Informatics EngineeringIn the last few years, the automotive industry has relied heavily on deep learning applications for
perception solutions. With data-heavy sensors, such as LiDAR, becoming a standard, the task of
developing low-power and real-time applications has become increasingly more challenging. To obtain
the maximum computational efficiency, no longer can one focus solely on the software aspect of such
applications, while disregarding the underlying hardware.
In this thesis, a hardware-software co-design approach is used to implement an inference application
leveraging the SqueezeSegV3, a LiDAR-based convolutional neural network, on the Versal ACAP VCK190
FPGA. Automotive requirements carefully drive the development of the proposed solution, with real-time
performance and low power consumption being the target metrics.
A first experiment validates the suitability of Xilinx’s Vitis-AI tool for the deployment of deep
convolutional neural networks on FPGAs. Both the ResNet-18 and SqueezeNet neural networks are
deployed to the Zynq UltraScale+ MPSoC ZCU104 and Versal ACAP VCK190 FPGAs. The results show
that both networks achieve far more than the real-time requirements while consuming low power.
Compared to an NVIDIA RTX 3090 GPU, the performance per watt during both network’s inference is 12x
and 47.8x higher and 15.1x and 26.6x higher respectively for the Zynq UltraScale+ MPSoC ZCU104 and
the Versal ACAP VCK190 FPGA. These results are obtained with no drop in accuracy in the quantization
step.
A second experiment builds upon the results of the first by deploying a real-time application containing
the SqueezeSegV3 model using the Semantic-KITTI dataset. A framerate of 11 Hz is achieved with a peak
power consumption of 78 Watts. The quantization step results in a minimal accuracy and IoU degradation
of 0.7 and 1.5 points respectively. A smaller version of the same model is also deployed achieving a
framerate of 19 Hz and a peak power consumption of 76 Watts. The application performs semantic
segmentation over all the point cloud with a field of view of 360°.Nos últimos anos a indústria automóvel tem cada vez mais aplicado deep learning para solucionar
problemas de perceção. Dado que os sensores que produzem grandes quantidades de dados, como o
LiDAR, se têm tornado standard, a tarefa de desenvolver aplicações de baixo consumo energético e com
capacidades de reagir em tempo real tem-se tornado cada vez mais desafiante. Para obter a máxima
eficiência computacional, deixou de ser possível focar-se apenas no software aquando do
desenvolvimento de uma aplicação deixando de lado o hardware subjacente.
Nesta tese, uma abordagem de desenvolvimento simultâneo de hardware e software é usada para
implementar uma aplicação de inferência usando o SqueezeSegV3, uma rede neuronal convolucional
profunda, na FPGA Versal ACAP VCK190. São os requisitos automotive que guiam o desenvolvimento da
solução proposta, sendo a performance em tempo real e o baixo consumo energético, as métricas alvo
principais.
Uma primeira experiência valida a aptidão da ferramenta Vitis-AI para a implantação de redes
neuronais convolucionais profundas em FPGAs. As redes ResNet-18 e SqueezeNet são ambas
implantadas nas FPGAs Zynq UltraScale+ MPSoC ZCU104 e Versal ACAP VCK190. Os resultados
mostram que ambas as redes ultrapassam os requisitos de tempo real consumindo pouca energia.
Comparado com a GPU NVIDIA RTX 3090, a performance por Watt durante a inferência de ambas as
redes é superior em 12x e 47.8x e 15.1x e 26.6x respetivamente na Zynq UltraScale+ MPSoC ZCU104
e na Versal ACAP VCK190. Estes resultados foram obtidos sem qualquer perda de accuracy na etapa de
quantização.
Uma segunda experiência é feita no seguimento dos resultados da primeira, implantando uma
aplicação de inferência em tempo real contendo o modelo SqueezeSegV3 e usando o conjunto de dados
Semantic-KITTI. Um framerate de 11 Hz é atingido com um pico de consumo energético de 78 Watts. O
processo de quantização resulta numa perda mínima de accuracy e IoU com valores de 0.7 e 1.5 pontos
respetivamente. Uma versão mais pequena do mesmo modelo é também implantada, atingindo uma
framerate de 19 Hz e um pico de consumo energético de 76 Watts. A aplicação desenvolvida executa
segmentação semântica sobre a totalidade das nuvens de pontos LiDAR, com um campo de visão de
360°
ENERGETIC: Final Report v1.1
The ENERGETIC (ENergy-aware hEteRoGenEous compuTIng at sCale) project set out to look at whether the
use of accelerators would give significant energy savings compared to CPU-only computations, and to
determine which classes of codes should run on heterogeneous architectures. Furthermore, the project
would examine opportunities and challenges for measuring energy in a standardised, portable manner
irrespective of architectural platform
Evaluation and implementation of an auto-encoder for compression of satellite images in the ScOSA project
The thesis evaluates the efficiency of various autoencoder neural networks for image compression regarding satellite imagery. The results highlight the evaluation and implementation of autoencoder architectures and the procedures required to deploy neural networks to reliable embedded devices. The developed autoencoders evaluated, targeting a ZYNQ 7020 FPGA (Field Programmable Gate Array) and a ZU7EV FPGA
Geometric Algorithms for Modeling Plant Roots from Images
Roots, considered as the ”hidden half of the plant”, are essential to a plant’s health and pro- ductivity. Understanding root architecture has the potential to enhance efforts towards im- proving crop yield. In this dissertation we develop geometric approaches to non-destructively characterize the full architecture of the root system from 3D imaging while making com- putational advances in topological optimization. First, we develop a global optimization algorithm to remove topological noise, with applications in both root imaging and com- puter graphics. Second, we use our topology simplification algorithm, other methods from computer graphics, and customized algorithms to develop a high-throughput pipeline for computing hierarchy and fine-grained architectural traits from 3D imaging of maize roots. Finally, we develop an algorithm for consistently simplifying the topology of nested shapes, with a motivating application in temporal root system analysis. Along the way, we con- tribute to the computer graphics community a pair of topological simplification algorithms both for repairing a single 3D shape and for repairing a sequence of nested shapes
- …