51 research outputs found

    Binomial Checkpointing for Arbitrary Programs with No User Annotation

    Get PDF
    Heretofore, automatic checkpointing at procedure-call boundaries, to reduce the space complexity of reverse mode, has been provided by systems like Tapenade. However, binomial checkpointing, or treeverse, has only been provided in Automatic Differentiation (AD) systems in special cases, e.g., through user-provided pragmas on DO loops in Tapenade, or as the nested taping mechanism in adol-c for time integration processes, which requires that user code be refactored. We present a framework for applying binomial checkpointing to arbitrary code with no special annotation or refactoring required. This is accomplished by applying binomial checkpointing directly to a program trace. This trace is produced by a general-purpose checkpointing mechanism that is orthogonal to AD

    Binomial Checkpointing for Arbitrary Programs with No User Annotation

    Get PDF
    Heretofore, automatic checkpointing at procedure-call boundaries, to reduce the space complexity of reverse mode, has been provided by systems like Tapenade. However, binomial checkpointing, or treeverse, has only been provided in Automatic Differentiation (AD) systems in special cases, e.g., through user-provided pragmas on DO loops in Tapenade, or as the nested taping mechanism in adol-c for time integration processes, which requires that user code be refactored. We present a framework for applying binomial checkpointing to arbitrary code with no special annotation or refactoring required. This is accomplished by applying binomial checkpointing directly to a program trace. This trace is produced by a general-purpose checkpointing mechanism that is orthogonal to AD

    The automation of PDE-constrained optimisation and its applications

    No full text
    This thesis is concerned with the automation of solving optimisation problems constrained by partial differential equations (PDEs). Gradient-based optimisation algorithms are the key to solve optimisation problems of practical interest. The required derivatives can be efficiently computed with the adjoint approach. However, current methods for the development of adjoint models often require a significant amount of effort and expertise, in particular for non-linear time-dependent problems. This work presents a new high-level reinterpretation of algorithmic differentiation to develop adjoint models. This reinterpretation considers the discrete system as a sequence of equation solves. Applying this approach to a general finite-element framework results in an automatic and robust way of deriving and solving adjoint models. This drastically reduces the development effort compared to traditional methods. Based on this result, a new framework for rapidly defining and solving optimisation problems constrained by PDEs is developed. The user specifies the discrete optimisation problem in a compact high-level language that resembles the mathematical structure of the underlying system. All remaining steps, including parameter updates, PDE solves and derivative computations, are performed without user intervention. The framework can be applied to a wide range of governing PDEs, and interfaces to various gradient-free and gradient-based optimisation algorithms. The capabilities of this framework are demonstrated through the application to two PDE-constrained optimisation problems. The first is concerned with the optimal layout of turbines in tidal stream farms; this optimisation problem is one of the main challenges facing the marine renewable energy industry. The second application applies data assimilation to reconstruct the profile of tsunami waves based on inundation observations. This provides the first step towards the general reconstruction of tsunami signals from satellite information

    Scaling a convolutional neural network based Flower counting application in a distributed GPU cluster

    Get PDF
    Taking advantage of modern data acquisition techniques, the researchers of P2IRC located at the University of Saskatchewan developed an application to monitor the status of the flower growth during different phases of the blooming period and the yield prediction of canola crops. Though the application could predict the near accurate number of flowers in a few scenarios, its inability to function under challenging situations such as misinterpreting sun reflection or dust along the roadside as flowers have motivated the researchers to find an alternative approach of counting flowers. In addition to being a more accurate version, another goal is for the new application to be faster to infer the number of flowers and scalable in distributed environments. Putting these goals in mind, in this thesis, a Convolutional neural network (CNN) based flower counting application is developed and evaluated taking inspiration from two other previous works where CNN was used for counting heads in dense crowds and predicting the number of bacterial cells from medical imagery. In addition to that, the application addresses the performance and the accuracy goals previously mentioned. Two challenges of using the neural network are (a) the training needs a large volume of data to converge to a low error and (b) the training is computationally expensive and it takes longer time to complete. To address the first challenge, experiments were run with both "ground truth" estimated using a modified version of the previous flower counter, and ground truth from manual annotation. To address the problem of long training time, two distributed versions of the proposed application were created based on two different distributed architectures called Parameter Server and Ring-AllReduce. Moreover, a detailed explanation of the proposed CNN's architecture along with its memory footprints and GPU utilization is also organized as an in-depth case study to help trace the model's memory consumption during training. From different sets of experiments, the new flower counter application is observed more accurate than its previous version and both implementations of its distributed versions successfully reduced the total completion time as a result of being linearly scalable when more workers are added to run the training. The Ring-AllReduce version performed slightly better than the Parameter Server, but the differences were not substantial

    Transactional Data Structures

    Get PDF

    Survey on Large Scale Neural Network Training

    Get PDF
    International audienceModern Deep Neural Networks (DNNs) require significant memory to store weight, activations, and other intermediate tensors during training. Hence, many models don't fit one GPU device or can be trained using only a small per-GPU batch size. This survey provides a systematic overview of the approaches that enable more efficient DNNs training. We analyze techniques that save memory and make good use of computation and communication resources on architectures with a single or several GPUs. We summarize the main categories of strategies and compare strategies within and across categories. Along with approaches proposed in the literature, we discuss available implementations

    Transactional data structures

    Get PDF
    Concurrent programming is difficult and the effort is rarely rewarded by faster execution. The concurrency problem arises because information cannot pass instantly between processors resulting in temporal uncertainty. This thesis explores the idea that immutable data and distributed concurrency control can be combined to allow scalable concurrent execution and make concurrent programming easier. A concurrent system that does not impose a global ordering on events lends itself to a scalable distributed implementation. A concurrent programming environment in which the ordering of events affecting an object is enforced locally has intuitive concurrent semantics. This thesis introduces Transactional Data Structures which are data structures that permit access to past versions, although not all accesses succeed. These data structures form the basis of a concurrent programming solution that supports database type transactions in memory. Transactional Data Structures permit non-blocking concurrent access to familiar abstract data types such as deques, maps, vectors and priority queues. Using these data structures a programmer can write a concurrent program in C without having to reason about locks. The solution is evaluated by comparing the performance of a concurrent algorithm to calculate the minimum spanning tree of a graph with that of a similar algorithm which uses Transactional Memory and by comparing a non-blocking Producer Consumer Queue with its blocking counterpart.EThOS - Electronic Theses Online ServiceGBUnited Kingdo
    • …
    corecore