787 research outputs found
BitGNN: Unleashing the Performance Potential of Binary Graph Neural Networks on GPUs
Recent studies have shown that Binary Graph Neural Networks (GNNs) are
promising for saving computations of GNNs through binarized tensors. Prior
work, however, mainly focused on algorithm designs or training techniques,
leaving it open to how to materialize the performance potential on accelerator
hardware fully. This work redesigns the binary GNN inference backend from the
efficiency perspective. It fills the gap by proposing a series of abstractions
and techniques to map binary GNNs and their computations best to fit the nature
of bit manipulations on GPUs. Results on real-world graphs with GCNs,
GraphSAGE, and GraphSAINT show that the proposed techniques outperform
state-of-the-art binary GNN implementations by 8-22X with the same accuracy
maintained. BitGNN code is publicly available.Comment: To appear in the International Conference on Supercomputing (ICS'23
LIPIcs, Volume 261, ICALP 2023, Complete Volume
LIPIcs, Volume 261, ICALP 2023, Complete Volum
Analysing and Reducing Costs of Deep Learning Compiler Auto-tuning
Deep Learning (DL) is significantly impacting many industries, including automotive, retail and medicine, enabling autonomous driving, recommender systems and genomics modelling, amongst other applications. At the same time, demand for complex and fast DL models is continually growing. The most capable models tend to exhibit highest operational costs, primarily due to their large computational resource footprint and inefficient utilisation of computational resources employed by DL systems. In an attempt to tackle these problems, DL compilers and auto-tuners emerged, automating the traditionally manual task of DL model performance optimisation. While auto-tuning improves model inference speed, it is a costly process, which limits its wider adoption within DL deployment pipelines. The high operational costs associated with DL auto-tuning have multiple causes. During operation, DL auto-tuners explore large search spaces consisting of billions of tensor programs, to propose potential candidates that improve DL model inference latency. Subsequently, DL auto-tuners measure candidate performance in isolation on the target-device, which constitutes the majority of auto-tuning compute-time. Suboptimal candidate proposals, combined with their serial measurement in an isolated target-device lead to prolonged optimisation time and reduced resource availability, ultimately reducing cost-efficiency of the process. In this thesis, we investigate the reasons behind prolonged DL auto-tuning and quantify their impact on the optimisation costs, revealing directions for improved DL auto-tuner design. Based on these insights, we propose two complementary systems: Trimmer and DOPpler. Trimmer improves tensor program search efficacy by filtering out poorly performing candidates, and controls end-to-end auto-tuning using cost objectives, monitoring optimisation cost. Simultaneously, DOPpler breaks long-held assumptions about the serial candidate measurements by successfully parallelising them intra-device, with minimal penalty to optimisation quality. Through extensive experimental evaluation of both systems, we demonstrate that they significantly improve cost-efficiency of autotuning (up to 50.5%) across a plethora of tensor operators, DL models, auto-tuners and target-devices
On Parallelization of Categorical Data Clustering
We study parallelization of categorical data clustering algorithms in an MPI platform. Clustering such data has been a daunting task even for sequential algorithms, mainly due to the challenges in finding suitable similarity/distance measures. We propose a parallel version of the k-modes algorithm, called PV3, which maintains the same clustering quality as produced by the sequential approach while achieving reasonable speed-ups. PV3 is programmed to ensure deterministic processing in a parallel environment. To produce better clustering results, we then develop an initialization method called Revised Density Method (RDM) based on the notion of density. Additionally, we develop variants of the RDM method to further enhance its performance. we then study effective ways to parallelize RDM and its variants. To further exploit parallelism opportunities, we develop an Ensemble Parallelizing Process (EPP) framework. This framework can be used with any desired initialization/clustering algorithms with different levels of parallelism. Using our different RDM initialization techniques along with the PV3 algorithm in the EPP framework, we then build an RDM realization of EPP, called RDM EPP. The result of our numerous experiments using benchmark categorical datasets indicate the quality metric of RDM EPP to be among the top three sequential k-modes based clustering algorithms. In terms of speed up, the results indicate to be 7 times faster for some datasets, though much larger datasets are required for a more comprehensive scalability study of RDM EPP
2023-2024 Boise State University Undergraduate Catalog
This catalog is primarily for and directed at students. However, it serves many audiences, such as high school counselors, academic advisors, and the public. In this catalog you will find an overview of Boise State University and information on admission, registration, grades, tuition and fees, financial aid, housing, student services, and other important policies and procedures. However, most of this catalog is devoted to describing the various programs and courses offered at Boise State
LIPIcs, Volume 274, ESA 2023, Complete Volume
LIPIcs, Volume 274, ESA 2023, Complete Volum
Fundamentals
Volume 1 establishes the foundations of this new field. It goes through all the steps from data collection, their summary and clustering, to different aspects of resource-aware learning, i.e., hardware, memory, energy, and communication awareness. Machine learning methods are inspected with respect to resource requirements and how to enhance scalability on diverse computing architectures ranging from embedded systems to large computing clusters
Analysis of Embedded Controllers Subject to Computational Overruns
Microcontrollers have become an integral part of modern everyday embedded systems, such as smart bikes, cars, and drones. Typically, microcontrollers operate under real-time constraints, which require the timely execution of programs on the resource-constrained hardware. As embedded systems are becoming increasingly more complex, microcontrollers run the risk of violating their timing constraints, i.e., overrunning the program deadlines. Breaking these constraints can cause severe damage to both the embedded system and the humans interacting with the device. Therefore, it is crucial to analyse embedded systems properly to ensure that they do not pose any significant danger if the microcontroller overruns a few deadlines.However, there are very few tools available for assessing the safety and performance of embedded control systems when considering the implementation of the microcontroller. This thesis aims to fill this gap in the literature by presenting five papers on the analysis of embedded controllers subject to computational overruns. Details about the real-time operating system's implementation are included into the analysis, such as what happens to the controller's internal state representation when the timing constraints are violated. The contribution includes theoretical and computational tools for analysing the embedded system's stability, performance, and real-time properties.The embedded controller is analysed under three different types of timing violations: blackout events (when no control computation is completed during long periods), weakly-hard constraints (when the number of deadline overruns is constrained over a window), and stochastic overruns (when violations of timing constraints are governed by a probabilistic process). These scenarios are combined with different implementation policies to reduce the gap between the analysis and its practical applicability. The analyses are further validated with a comprehensive experimental campaign performed on both a set of physical processes and multiple simulations.In conclusion, the findings of this thesis reveal that the effect deadline overruns have on the embedded system heavily depends the implementation details and the system's dynamics. Additionally, the stability analysis of embedded controllers subject to deadline overruns is typically conservative, implying that additional insights can be gained by also analysing the system's performance
- …