969 research outputs found

    Development of a SQUID magnetometry system for cryogenic neutron electric dipole moment experiment

    Get PDF
    A measurement of the neutron electric dipole moment (nEDM) could hold the key to understanding why the visible universe is the way it is: why matter should predominate over antimatter. As a charge-parity violating (CPV) quantity, an nEDM could provide an insight into new mechanisms that address this baryon asymmetry. The motivation for an improved sensitivity to an nEDM is to find it to be non-zero at a level consistent with certain beyond the Standard Model theories that predict new sources of CPV, or to establish a new limit that constrains them. CryoEDM is an experiment that sought to better the current limit of dn<2.9×1026e|d_n| < 2.9 \times 10^{-26}\,e\,cm by an order of magnitude. It is designed to measure the nEDM via the Ramsey Method of Separated Oscillatory Fields, in which it is critical that the magnetic field remains stable throughout. A way of accurately tracking the magnetic fields, moreover at a temperature 0.5\sim 0.5\,K, is crucial for CryoEDM, and for future cryogenic projects. This thesis presents work focussing on the development of a 12-SQUID magnetometry system for CryoEDM, that enables the magnetic field to be monitored to a precision of 0.10.1\,pT. A major component of its infrastructure is the superconducting capillary shields, which screen the input lines of the SQUIDs from the pick up of spurious magnetic fields that will perturb a SQUID's measurement. These are shown to have a transverse shielding factor of >1×107> 1 \times 10^{7}, which is a few orders of magnitude greater than the calculated requirement. Efforts to characterise the shielding of the SQUID chips themselves are also discussed. The use of Cryoperm for shields reveals a tension between improved SQUID noise and worse neutron statistics. Investigations show that without it, SQUIDs have an elevated noise when cooled in a substantial magnetic field; with it, magnetostatic simulations suggest that it is detrimental to the polarisation of neutrons in transport. The findings suggest that with proper consideration, it is possible to reach a compromise between the two behaviours. Computational work to develop a simulation of SQUID data is detailed, which is based on the Laplace equation for the magnetic scalar potential. These data are ultimately used in the development of a linear regression technique to determine the volume-averaged magnetic field in the neutron cells. This proves highly effective in determining the fields within the 0.10.1\,pT requirement under certain conditions

    Guided rewriting and constraint satisfaction for parallel GPU code generation

    Get PDF
    Graphics Processing Units (GPUs) are notoriously hard to optimise for manually due to their scheduling and memory hierarchies. What is needed are good automatic code generators and optimisers for such parallel hardware. Functional approaches such as Accelerate, Futhark and LIFT leverage a high-level algorithmic Intermediate Representation (IR) to expose parallelism and abstract the implementation details away from the user. However, producing efficient code for a given accelerator remains challenging. Existing code generators depend on the user input to choose a subset of hard-coded optimizations or automated exploration of implementation search space. The former suffers from the lack of extensibility, while the latter is too costly due to the size of the search space. A hybrid approach is needed, where a space of valid implementations is built automatically and explored with the aid of human expertise. This thesis presents a solution combining user-guided rewriting and automatically generated constraints to produce high-performance code. The first contribution is an automatic tuning technique to find a balance between performance and memory consumption. Leveraging its functional patterns, the LIFT compiler is empowered to infer tuning constraints and limit the search to valid tuning combinations only. Next, the thesis reframes parallelisation as a constraint satisfaction problem. Parallelisation constraints are extracted automatically from the input expression, and a solver is used to identify valid rewriting. The constraints truncate the search space to valid parallel mappings only by capturing the scheduling restrictions of the GPU in the context of a given program. A synchronisation barrier insertion technique is proposed to prevent data races and improve the efficiency of the generated parallel mappings. The final contribution of this thesis is the guided rewriting method, where the user encodes a design space of structural transformations using high-level IR nodes called rewrite points. These strongly typed pragmas express macro rewrites and expose design choices as explorable parameters. The thesis proposes a small set of reusable rewrite points to achieve tiling, cache locality, data reuse and memory optimisation. A comparison with the vendor-provided handwritten kernel ARM Compute Library and the TVM code generator demonstrates the effectiveness of this thesis' contributions. With convolution as a use case, LIFT-generated direct and GEMM-based convolution implementations are shown to perform on par with the state-of-the-art solutions on a mobile GPU. Overall, this thesis demonstrates that a functional IR yields well to user-guided and automatic rewriting for high-performance code generation

    Automatic Loop Nest Parallelization for the Predictable Execution Model

    Get PDF
    Currently, embedded real-time systems still widely use single-core processors. A major challenge in the adoption of multicore processors is the presence of shared hardware resources such as main memory. Contention between threads executing on different cores for access to such resources makes it difficult to tightly estimate the Worst-Case Execution Time (WCET) of applications. To safely employ multicore processors in real-time systems, previous work has introduced a PRedictable Execution Model (PREM) for embedded Multi-Processor Systems-on-a-Chip (MPSoCs). Under PREM, each thread is divided into memory phases, where the code and data required by the thread are moved from main memory to a local memory (cache or scratchpad) or vice versa, and execution phases, where the thread computes based on the code and data available in local memory. Memory phases are then scheduled by the Operating System (OS) to avoid contention among threads, thus resulting in tight WCET bounds. The main challenge in applying the model is to automatically generate optimized PREM-compliant code instead of rewriting programs manually. Note that many programs of interests, such as emerging AI and neural network kernels, comprise both compute-intensive and memory-intensive deeply nested loops. Hence, PREM code generation and optimization should be applicable to nested loop structures and consider whether performance is constrained by computation or memory transfers. In this thesis, we address the problem of automatically parallelizing and optimizing nested loop structure programs by presenting a workflow that automatically generates PREM-compliant optimized code. To correctly model the structure of nested loop programs, we leverage existing polyhedral compilation tools that analyze the original program and generate optimized executables. Two main techniques are adopted for optimization: loop tiling and parallelization. We build a timing model to estimate the length of execution and memory phases, and then construct a Directed Acyclic Graph (DAG) of program phases to estimate its makespan. During this process, our framework searches for the combination of tile sizes and thread numbers that minimize the makespan of the program; given the complexity of the optimization problem, we design a heuristic algorithm to find solutions close to the optimal. Finally, to show its usefulness, we evaluate our technique based on the Gem5 architectural simulator on computational kernels from the PolyBench-NN benchmark

    ACiS: smart switches with application-level acceleration

    Full text link
    Network performance has contributed fundamentally to the growth of supercomputing over the past decades. In parallel, High Performance Computing (HPC) peak performance has depended, first, on ever faster/denser CPUs, and then, just on increasing density alone. As operating frequency, and now feature size, have levelled off, two new approaches are becoming central to achieving higher net performance: configurability and integration. Configurability enables hardware to map to the application, as well as vice versa. Integration enables system components that have generally been single function-e.g., a network to transport data—to have additional functionality, e.g., also to operate on that data. More generally, integration enables compute-everywhere: not just in CPU and accelerator, but also in network and, more specifically, the communication switches. In this thesis, we propose four novel methods of enhancing HPC performance through Advanced Computing in the Switch (ACiS). More specifically, we propose various flexible and application-aware accelerators that can be embedded into or attached to existing communication switches to improve the performance and scalability of HPC and Machine Learning (ML) applications. We follow a modular design discipline through introducing composable plugins to successively add ACiS capabilities. In the first work, we propose an inline accelerator to communication switches for user-definable collective operations. MPI collective operations can often be performance killers in HPC applications; we seek to solve this bottleneck by offloading them to reconfigurable hardware within the switch itself. We also introduce a novel mechanism that enables the hardware to support MPI communicators of arbitrary shape and that is scalable to very large systems. In the second work, we propose a look-aside accelerator for communication switches that is capable of processing packets at line-rate. Functions requiring loops and states are addressed in this method. The proposed in-switch accelerator is based on a RISC-V compatible Coarse Grained Reconfigurable Arrays (CGRAs). To facilitate usability, we have developed a framework to compile user-provided C/C++ codes to appropriate back-end instructions for configuring the accelerator. In the third work, we extend ACiS to support fused collectives and the combining of collectives with map operations. We observe that there is an opportunity of fusing communication (collectives) with computation. Since the computation can vary for different applications, ACiS support should be programmable in this method. In the fourth work, we propose that switches with ACiS support can control and manage the execution of applications, i.e., that the switch be an active device with decision-making capabilities. Switches have a central view of the network; they can collect telemetry information and monitor application behavior and then use this information for control, decision-making, and coordination of nodes. We evaluate the feasibility of ACiS through extensive RTL-based simulation as well as deployment in an open-access cloud infrastructure. Using this simulation framework, when considering a Graph Convolutional Network (GCN) application as a case study, a speedup of on average 3.4x across five real-world datasets is achieved on 24 nodes compared to a CPU cluster without ACiS capabilities

    Tools for efficient Deep Learning

    Get PDF
    In the era of Deep Learning (DL), there is a fast-growing demand for building and deploying Deep Neural Networks (DNNs) on various platforms. This thesis proposes five tools to address the challenges for designing DNNs that are efficient in time, in resources and in power consumption. We first present Aegis and SPGC to address the challenges in improving the memory efficiency of DL training and inference. Aegis makes mixed precision training (MPT) stabler by layer-wise gradient scaling. Empirical experiments show that Aegis can improve MPT accuracy by at most 4\%. SPGC focuses on structured pruning: replacing standard convolution with group convolution (GConv) to avoid irregular sparsity. SPGC formulates GConv pruning as a channel permutation problem and proposes a novel heuristic polynomial-time algorithm. Common DNNs pruned by SPGC have maximally 1\% higher accuracy than prior work. This thesis also addresses the challenges lying in the gap between DNN descriptions and executables by Polygeist for software and POLSCA for hardware. Many novel techniques, e.g. statement splitting and memory partitioning, are explored and used to expand polyhedral optimisation. Polygeist can speed up software execution in sequential and parallel by 2.53 and 9.47 times on Polybench/C. POLSCA achieves 1.5 times speedup over hardware designs directly generated from high-level synthesis on Polybench/C. Moreover, this thesis presents Deacon, a framework that generates FPGA-based DNN accelerators of streaming architectures with advanced pipelining techniques to address the challenges from heterogeneous convolution and residual connections. Deacon provides fine-grained pipelining, graph-level optimisation, and heuristic exploration by graph colouring. Compared with prior designs, Deacon shows resource/power consumption efficiency improvement of 1.2x/3.5x for MobileNets and 1.0x/2.8x for SqueezeNets. All these tools are open source, some of which have already gained public engagement. We believe they can make efficient deep learning applications easier to build and deploy.Open Acces

    Data-driven deep-learning methods for the accelerated simulation of Eulerian fluid dynamics

    Get PDF
    Deep-learning (DL) methods for the fast inference of the temporal evolution of fluid-dynamics systems, based on the previous recognition of features underlying large sets of fluid-dynamics data, have been studied. Specifically, models based on convolution neural networks (CNNs) and graph neural networks (GNNs) were proposed and discussed. A U-Net, a popular fully-convolutional architecture, was trained to infer wave dynamics on liquid surfaces surrounded by walls, given as input the system state at previous time-points. A term for penalising the error of the spatial derivatives was added to the loss function, which resulted in a suppression of spurious oscillations and a more accurate location and length of the predicted wavefronts. This model proved to accurately generalise to complex wall geometries not seen during training. As opposed to the image data-structures processed by CNNs, graphs offer higher freedom on how data is organised and processed. This motivated the use of graphs to represent the state of fluid-dynamic systems discretised by unstructured sets of nodes, and GNNs to process such graphs. Graphs have enabled more accurate representations of curvilinear geometries and higher resolution placement exclusively in areas where physics is more challenging to resolve. Two novel GNN architectures were designed for fluid-dynamics inference: the MuS-GNN, a multi-scale GNN, and the REMuS-GNN, a rotation-equivariant multi-scale GNN. Both architectures work by repeatedly passing messages from each node to its nearest nodes in the graph. Additionally, lower-resolutions graphs, with a reduced number of nodes, are defined from the original graph, and messages are also passed from finer to coarser graphs and vice-versa. The low-resolution graphs allowed for efficiently capturing physics encompassing a range of lengthscales. Advection and fluid flow, modelled by the incompressible Navier-Stokes equations, were the two types of problems used to assess the proposed GNNs. Whereas a single-scale GNN was sufficient to achieve high generalisation accuracy in advection simulations, flow simulation highly benefited from an increasing number of low-resolution graphs. The generalisation and long-term accuracy of these simulations were further improved by the REMuS-GNN architecture, which processes the system state independently of the orientation of the coordinate system thanks to a rotation-invariant representation and carefully designed components. To the best of the author’s knowledge, the REMuS-GNN architecture was the first rotation-equivariant and multi-scale GNN. The simulations were accelerated between one (in a CPU) and three (in a GPU) orders of magnitude with respect to a CPU-based numerical solver. Additionally, the parallelisation of multi-scale GNNs resulted in a close-to-linear speedup with the number of CPU cores or GPUs.Open Acces

    Advancing Carbon Sequestration through Smart Proxy Modeling: Leveraging Domain Expertise and Machine Learning for Efficient Reservoir Simulation

    Get PDF
    Geological carbon sequestration (GCS) offers a promising solution to effectively manage extra carbon, mitigating the impact of climate change. This doctoral research introduces a cutting-edge Smart Proxy Modeling-based framework, integrating artificial neural networks (ANNs) and domain expertise, to re-engineer and empower numerical reservoir simulation for efficient modeling of CO2 sequestration and demonstrate predictive conformance and replicative capabilities of smart proxy modeling. Creating well-performing proxy models requires extensive human intervention and trial-and-error processes. Additionally, a large training database is essential to ANN model for complex tasks such as deep saline aquifer CO2 sequestration since it is used as the neural network\u27s input and output data. One major limitation in CCS programs is the lack of real field data due to a lack of field applications and issues with confidentiality. Considering these drawbacks, and due to high-dimensional nonlinearity, heterogeneity, and coupling of multiple physical processes associated with numerical reservoir simulation, novel research to handle these complexities as it allows for the creation of possible CO2 sequestration scenarios that may be used as a training set. This study addresses several types of static and dynamic realistic and practical field-base data augmentation techniques ranging from spatial complexity, spatio-temporal complexity, and heterogeneity of reservoir characteristics. By incorporating domain-expertise-based feature generation, this framework honors precise representation of reservoir overcoming computational challenges associated with numerical reservoir tools. The developed ANN accurately replicated fluid flow behavior, resulting in significant computational savings compared to traditional numerical simulation models. The results showed that all the ML models achieved very good accuracies and high efficiency. The findings revealed that the quality of the path between the focal cell and injection wells emerged as the most crucial factor in both CO2 saturation and pressure estimation models. These insights significantly contribute to our understanding of CO2 plume monitoring, paving the way for breakthroughs in investigating reservoir behavior at a minimal computational cost. The study\u27s commitment to replicating numerical reservoir simulation results underscores the model\u27s potential to contribute valuable insights into the behavior and performance of CO2 sequestration systems, as a complimentary tool to numerical reservoir simulation when there is no measured data available from the field. The transformative nature of this research has vast implications for advancing carbon storage modeling technologies. By addressing the computational limitations of traditional numerical reservoir models and harnessing the synergy between machine learning and domain expertise, this work provides a practical workflow for efficient decision-making in sequestration projects

    Novel neural architectures & algorithms for efficient inference

    Get PDF
    In the last decade, the machine learning universe embraced deep neural networks (DNNs) wholeheartedly with the advent of neural architectures such as recurrent neural networks (RNNs), convolutional neural networks (CNNs), transformers, etc. These models have empowered many applications, such as ChatGPT, Imagen, etc., and have achieved state-of-the-art (SOTA) performance on many vision, speech, and language modeling tasks. However, SOTA performance comes with various issues, such as large model size, compute-intensive training, increased inference latency, higher working memory, etc. This thesis aims at improving the resource efficiency of neural architectures, i.e., significantly reducing the computational, storage, and energy consumption of a DNN without any significant loss in performance. Towards this goal, we explore novel neural architectures as well as training algorithms that allow low-capacity models to achieve near SOTA performance. We divide this thesis into two dimensions: \textit{Efficient Low Complexity Models}, and \textit{Input Hardness Adaptive Models}. Along the first dimension, i.e., \textit{Efficient Low Complexity Models}, we improve DNN performance by addressing instabilities in the existing architectures and training methods. We propose novel neural architectures inspired by ordinary differential equations (ODEs) to reinforce input signals and attend to salient feature regions. In addition, we show that carefully designed training schemes improve the performance of existing neural networks. We divide this exploration into two parts: \textsc{(a) Efficient Low Complexity RNNs.} We improve RNN resource efficiency by addressing poor gradients, noise amplifications, and BPTT training issues. First, we improve RNNs by solving ODEs that eliminate vanishing and exploding gradients during the training. To do so, we present Incremental Recurrent Neural Networks (iRNNs) that keep track of increments in the equilibrium surface. Next, we propose Time Adaptive RNNs that mitigate the noise propagation issue in RNNs by modulating the time constants in the ODE-based transition function. We empirically demonstrate the superiority of ODE-based neural architectures over existing RNNs. Finally, we propose Forward Propagation Through Time (FPTT) algorithm for training RNNs. We show that FPTT yields significant gains compared to the more conventional Backward Propagation Through Time (BPTT) scheme. \textsc{(b) Efficient Low Complexity CNNs.} Next, we improve CNN architectures by reducing their resource usage. They require greater depth to generate high-level features, resulting in computationally expensive models. We design a novel residual block, the Global layer, that constrains the input and output features by approximately solving partial differential equations (PDEs). It yields better receptive fields than traditional convolutional blocks and thus results in shallower networks. Further, we reduce the model footprint by enforcing a novel inductive bias that formulates the output of a residual block as a spatial interpolation between high-compute anchor pixels and low-compute cheaper pixels. This results in spatially interpolated convolutional blocks (SI-CNNs) that have better compute and performance trade-offs. Finally, we propose an algorithm that enforces various distributional constraints during training in order to achieve better generalization. We refer to this scheme as distributionally constrained learning (DCL). In the second dimension, i.e., \textit{Input Hardness Adaptive Models}, we introduce the notion of the hardness of any input relative to any architecture. In the first dimension, a neural network allocates the same resources, such as compute, storage, and working memory, for all the inputs. It inherently assumes that all examples are equally hard for a model. In this dimension, we challenge this assumption using input hardness as our reasoning that some inputs are relatively easy for a network to predict compared to others. Input hardness enables us to create selective classifiers wherein a low-capacity network handles simple inputs while abstaining from a prediction on the complex inputs. Next, we create hybrid models that route the hard inputs from the low-capacity abstaining network to a high-capacity expert model. We design various architectures that adhere to this hybrid inference style. Further, input hardness enables us to selectively distill the knowledge of a high-capacity model into a low-capacity model by cleverly discarding hard inputs during the distillation procedure. Finally, we conclude this thesis by sketching out various interesting future research directions that emerge as an extension of different ideas explored in this work

    Rethinking FPGA Architectures for Deep Neural Network applications

    Get PDF
    The prominence of machine learning-powered solutions instituted an unprecedented trend of integration into virtually all applications with a broad range of deployment constraints from tiny embedded systems to large-scale warehouse computing machines. While recent research confirms the edges of using contemporary FPGAs to deploy or accelerate machine learning applications, especially where the latency and energy consumption are strictly limited, their pre-machine learning optimised architectures remain a barrier to the overall efficiency and performance. Realizing this shortcoming, this thesis demonstrates an architectural study aiming at solutions that enable hidden potentials in the FPGA technology, primarily for machine learning algorithms. Particularly, it shows how slight alterations to the state-of-the-art architectures could significantly enhance the FPGAs toward becoming more machine learning-friendly while maintaining the near-promised performance for the rest of the applications. Eventually, it presents a novel systematic approach to deriving new block architectures guided by designing limitations and machine learning algorithm characteristics through benchmarking. First, through three modifications to Xilinx DSP48E2 blocks, an enhanced digital signal processing (DSP) block for important computations in embedded deep neural network (DNN) accelerators is described. Then, two tiers of modifications to FPGA logic cell architecture are explained that deliver a variety of performance and utilisation benefits with only minor area overheads. Eventually, with the goal of exploring this new design space in a methodical manner, a problem formulation involving computing nested loops over multiply-accumulate (MAC) operations is first proposed. A quantitative methodology for deriving efficient coarse-grained compute block architectures from benchmarks is then suggested together with a family of new embedded blocks, called MLBlocks
    corecore