5,985 research outputs found
Faster inference from state space models via GPU computing
Funding: C.F.-J. is funded via a doctoral scholarship from the University of St Andrews, School of Mathematics and Statistics.Inexpensive Graphics Processing Units (GPUs) offer the potential to greatly speed up computation by employing their massively parallel architecture to perform arithmetic operations more efficiently. Population dynamics models are important tools in ecology and conservation. Modern Bayesian approaches allow biologically realistic models to be constructed and fitted to multiple data sources in an integrated modelling framework based on a class of statistical models called state space models. However, model fitting is often slow, requiring hours to weeks of computation. We demonstrate the benefits of GPU computing using a model for the population dynamics of British grey seals, fitted with a particle Markov chain Monte Carlo algorithm. Speed-ups of two orders of magnitude were obtained for estimations of the log-likelihood, compared to a traditional ‘CPU-only’ implementation, allowing for an accurate method of inference to be used where this was previously too computationally expensive to be viable. GPU computing has enormous potential, but one barrier to further adoption is a steep learning curve, due to GPUs' unique hardware architecture. We provide a detailed description of hardware and software setup, and our case study provides a template for other similar applications. We also provide a detailed tutorial-style description of GPU hardware architectures, and examples of important GPU-specific programming practices.Publisher PDFPeer reviewe
Exploring Hardware Fault Impacts on Different Real Number Representations of the Structural Resilience of TCUs in GPUs
The most recent generations of graphics processing units (GPUs) boost the execution of convolutional operations required by machine learning applications by resorting to specialized and efficient in-chip accelerators (Tensor Core Units or TCUs) that operate on matrix multiplication tiles. Unfortunately, modern cutting-edge semiconductor technologies are increasingly prone to hardware defects, and the trend to highly stress TCUs during the execution of safety-critical and high-performance computing (HPC) applications increases the likelihood of TCUs producing different kinds of failures. In fact, the intrinsic resiliency to hardware faults of arithmetic units plays a crucial role in safety-critical applications using GPUs (e.g., in automotive, space, and autonomous robotics). Recently, new arithmetic formats have been proposed, particularly those suited to neural network execution. However, the reliability characterization of TCUs supporting different arithmetic formats was still lacking. In this work, we quantitatively assessed the impact of hardware faults in TCU structures while employing two distinct formats (floating-point and posit) and using two different configurations (16 and 32 bits) to represent real numbers. For the experimental evaluation, we resorted to an architectural description of a TCU core (PyOpenTCU) and performed 120 fault simulation campaigns, injecting around 200,000 faults per campaign and requiring around 32 days of computation. Our results demonstrate that the posit format of TCUs is less affected by faults than the floating-point one (by up to three orders of magnitude for 16 bits and up to twenty orders for 32 bits). We also identified the most sensible fault locations (i.e., those that produce the largest errors), thus paving the way to adopting smart hardening solutions
Resource-aware scheduling for 2D/3D multi-/many-core processor-memory systems
This dissertation addresses the complexities of 2D/3D multi-/many-core processor-memory systems, focusing on two key areas: enhancing timing predictability in real-time multi-core processors and optimizing performance within thermal constraints. The integration of an increasing number of transistors into compact chip designs, while boosting computational capacity, presents challenges in resource contention and thermal management. The first part of the thesis improves timing predictability. We enhance shared cache interference analysis for set-associative caches, advancing the calculation of Worst-Case Execution Time (WCET). This development enables accurate assessment of cache interference and the effectiveness of partitioned schedulers in real-world scenarios. We introduce TCPS, a novel task and cache-aware partitioned scheduler that optimizes cache partitioning based on task-specific WCET sensitivity, leading to improved schedulability and predictability. Our research explores various cache and scheduling configurations, providing insights into their performance trade-offs. The second part focuses on thermal management in 2D/3D many-core systems. Recognizing the limitations of Dynamic Voltage and Frequency Scaling (DVFS) in S-NUCA many-core processors, we propose synchronous thread migrations as a thermal management strategy. This approach culminates in the HotPotato scheduler, which balances performance and thermal safety. We also introduce 3D-TTP, a transient temperature-aware power budgeting strategy for 3D-stacked systems, reducing the need for Dynamic Thermal Management (DTM) activation. Finally, we present 3QUTM, a novel method for 3D-stacked systems that combines core DVFS and memory bank Low Power Modes with a learning algorithm, optimizing response times within thermal limits. This research contributes significantly to enhancing performance and thermal management in advanced processor-memory systems
MGG: Accelerating Graph Neural Networks with Fine-grained intra-kernel Communication-Computation Pipelining on Multi-GPU Platforms
The increasing size of input graphs for graph neural networks (GNNs)
highlights the demand for using multi-GPU platforms. However, existing
multi-GPU GNN systems optimize the computation and communication individually
based on the conventional practice of scaling dense DNNs. For irregularly
sparse and fine-grained GNN workloads, such solutions miss the opportunity to
jointly schedule/optimize the computation and communication operations for
high-performance delivery. To this end, we propose MGG, a novel system design
to accelerate full-graph GNNs on multi-GPU platforms. The core of MGG is its
novel dynamic software pipeline to facilitate fine-grained
computation-communication overlapping within a GPU kernel. Specifically, MGG
introduces GNN-tailored pipeline construction and GPU-aware pipeline mapping to
facilitate workload balancing and operation overlapping. MGG also incorporates
an intelligent runtime design with analytical modeling and optimization
heuristics to dynamically improve the execution performance. Extensive
evaluation reveals that MGG outperforms state-of-the-art full-graph GNN systems
across various settings: on average 4.41X, 4.81X, and 10.83X faster than DGL,
MGG-UVM, and ROC, respectively
Towards a centralized multicore automotive system
Today’s automotive systems are inundated with embedded electronics to host chassis, powertrain, infotainment, advanced driver assistance systems, and other modern vehicle functions. As many as 100 embedded microcontrollers execute hundreds of millions of lines of code in a single vehicle. To control the increasing complexity in vehicle electronics and services, automakers are planning to consolidate different on-board automotive functions as software tasks on centralized multicore hardware platforms. However, these vehicle software services have different and contrasting timing, safety, and security requirements. Existing vehicle operating systems are ill-equipped to provide all the required service guarantees on a single machine. A centralized automotive system aims to tackle this by assigning software tasks to multiple criticality domains or levels according to their consequences of failures, or international safety standards like ISO 26262. This research investigates several emerging challenges in time-critical systems for a centralized multicore automotive platform and proposes a novel vehicle operating system framework to address them.
This thesis first introduces an integrated vehicle management system (VMS), called DriveOSâ„¢, for a PC-class multicore hardware platform. Its separation kernel design enables temporal and spatial isolation among critical and non-critical vehicle services in different domains on the same machine. Time- and safety-critical vehicle functions are implemented in a sandboxed Real-time Operating System (OS) domain, and non-critical software is developed in a sandboxed general-purpose OS (e.g., Linux, Android) domain. To leverage the advantages of model-driven vehicle function development, DriveOS provides a multi-domain application framework in Simulink. This thesis also presents a real-time task pipeline scheduling algorithm in multiprocessors for communication between connected vehicle services with end-to-end guarantees. The benefits and performance of the overall automotive system framework are demonstrated with hardware-in-the-loop testing using real-world applications, car datasets and simulated benchmarks, and with an early-stage deployment in a production-grade luxury electric vehicle
RecD: Deduplication for End-to-End Deep Learning Recommendation Model Training Infrastructure
We present RecD (Recommendation Deduplication), a suite of end-to-end
infrastructure optimizations across the Deep Learning Recommendation Model
(DLRM) training pipeline. RecD addresses immense storage, preprocessing, and
training overheads caused by feature duplication inherent in industry-scale
DLRM training datasets. Feature duplication arises because DLRM datasets are
generated from interactions. While each user session can generate multiple
training samples, many features' values do not change across these samples. We
demonstrate how RecD exploits this property, end-to-end, across a deployed
training pipeline. RecD optimizes data generation pipelines to decrease dataset
storage and preprocessing resource demands and to maximize duplication within a
training batch. RecD introduces a new tensor format, InverseKeyedJaggedTensors
(IKJTs), to deduplicate feature values in each batch. We show how DLRM model
architectures can leverage IKJTs to drastically increase training throughput.
RecD improves the training and preprocessing throughput and storage efficiency
by up to 2.48x, 1.79x, and 3.71x, respectively, in an industry-scale DLRM
training system.Comment: Published in the Proceedings of the Sixth Conference on Machine
Learning and Systems (MLSys 2023
Smoothed particle hydrodynamics method for free surface flow based on MPI parallel computing
In the field of computational fluid dynamics (CFD), smoothed particle hydrodynamics (SPH) is very suitable for simulating problems with large deformation, free surface flow and other types of flow scenarios. However, traditional smoothed particle hydrodynamics methods suffer from the problem of high computation complexity, which constrains their application in scenarios with accuracy requirements. DualSPHysics is an excellent smoothed particle hydrodynamics software proposed in academia. Based on this tool, this paper presents a largescale parallel smoothed particle hydrodynamics framework: parallelDualSPHysics, which can solve the simulation of large-scale free surface flow. First, an efficient domain decomposition algorithm is proposed. And the data structure of DualSPHysics in a parallel framework is reshaped. Secondly, we proposed a strategy of overlapping computation and communication to the parallel particle interaction and particle update module, which greatly improves the parallel efficiency of the smoothed particle hydrodynamics method. Finally, we also added the pre-processing and post-processing modules to enable parallelDualSPHysics to run in modern high performance computers. In addition, a thorough evaluation shows that the 3 to 120 million particles tested can still maintain more than 90% computing efficiency, which demonstrates that the parallel strategy can achieve superior parallel efficiency
Efficient GPU Offloading with OpenMP for a Hyperbolic Finite Volume Solver on Dynamically Adaptive Meshes
We identify and show how to overcome an OpenMP bottleneck in the administration of GPU memory. It arises for a wave equation solver on dynamically adaptive block-structured Cartesian meshes, which keeps all CPU threads busy and allows all of them to offload sets of patches to the GPU. Our studies show that multithreaded, concurrent, non-deterministic access to the GPU leads to performance breakdowns, since the GPU memory bookkeeping as offered through OpenMP’s map clause, i.e., the allocation and freeing, becomes another runtime challenge besides expensive data transfer and actual computation. We, therefore, propose to retain the memory management responsibility on the host: A caching mechanism acquires memory on the accelerator for all CPU threads, keeps hold of this memory and hands it out to the offloading threads upon demand. We show that this user-managed, CPU-based memory administration helps us to overcome the GPU memory bookkeeping bottleneck and speeds up the time-to-solution of Finite Volume kernels by more than an order of magnitude
ACiS: smart switches with application-level acceleration
Network performance has contributed fundamentally to the growth of supercomputing over the past decades. In parallel, High Performance Computing (HPC) peak performance has depended, first, on ever faster/denser CPUs, and then, just on increasing density alone. As operating frequency, and now feature size, have levelled off, two new approaches are becoming central to achieving higher net performance: configurability and integration. Configurability enables hardware to map to the application, as well as vice versa. Integration enables system components that have generally been single function-e.g., a network to transport data—to have additional functionality, e.g., also to operate on that data. More generally, integration enables compute-everywhere: not just in CPU and accelerator, but also in network and, more specifically, the communication switches.
In this thesis, we propose four novel methods of enhancing HPC performance through Advanced Computing in the Switch (ACiS). More specifically, we propose various flexible and application-aware accelerators that can be embedded into or attached to existing communication switches to improve the performance and scalability of HPC and Machine Learning (ML) applications. We follow a modular design discipline through introducing composable plugins to successively add ACiS capabilities.
In the first work, we propose an inline accelerator to communication switches for user-definable collective operations. MPI collective operations can often be performance killers in HPC applications; we seek to solve this bottleneck by offloading them to reconfigurable hardware within the switch itself. We also introduce a novel mechanism that enables the hardware to support MPI communicators of arbitrary shape and that is scalable to very large systems.
In the second work, we propose a look-aside accelerator for communication switches that is capable of processing packets at line-rate. Functions requiring loops and states are addressed in this method. The proposed in-switch accelerator is based on a RISC-V compatible Coarse Grained Reconfigurable Arrays (CGRAs).
To facilitate usability, we have developed a framework to compile user-provided C/C++ codes to appropriate back-end instructions for configuring the accelerator.
In the third work, we extend ACiS to support fused collectives and the combining of collectives with map operations. We observe that there is an opportunity of fusing communication (collectives) with computation. Since the computation can vary for different applications, ACiS support should be programmable in this method.
In the fourth work, we propose that switches with ACiS support can control and manage the execution of applications, i.e., that the switch be an active device with decision-making capabilities. Switches have a central view of the network; they can collect telemetry information and monitor application behavior and then use this information for control, decision-making, and coordination of nodes.
We evaluate the feasibility of ACiS through extensive RTL-based simulation as well as deployment in an open-access cloud infrastructure. Using this simulation framework, when considering a Graph Convolutional Network (GCN) application as a case study, a speedup of on average 3.4x across five real-world datasets is achieved on 24 nodes compared to a CPU cluster without ACiS capabilities
- …