910 research outputs found

    Highly parallel computation

    Get PDF
    Highly parallel computing architectures are the only means to achieve the computation rates demanded by advanced scientific problems. A decade of research has demonstrated the feasibility of such machines and current research focuses on which architectures designated as multiple instruction multiple datastream (MIMD) and single instruction multiple datastream (SIMD) have produced the best results to date; neither shows a decisive advantage for most near-homogeneous scientific problems. For scientific problems with many dissimilar parts, more speculative architectures such as neural networks or data flow may be needed

    The development and use of variables in mathematics and computer science

    Get PDF
    There are a wide variety of uses of variables in mathematics which we cope with in practice through conventions and tacit assumptions. Experience with computers has made us articulate, criticise and develop these assumptions much more carefully. Historically the term 'variable quantity' was introduced in the context of describing and calculating changing quantities which corresponded to phenomena in the observable world (e.g. the velocity of fluxion of a body moving under the inverse square law). The evolution of the concept has divorced it from these routes of reference and required us to establish the formal apparatus of interpretation and valuation. While the changes considered are highly structured this may be satisfactory, but computing power invites us to cope with change in vastly more complex, unstructured situations such as in simulation of 'real world' processes. We relate this challenge to the distinctive differences in the use of variables in mathematics and practical computing, and we develop a general framework in which all uses of variables can be described in a unified way

    Modeling Streams-based Variants of Ant Colony Optimisation for Parallel Systems

    Get PDF
    Wei Cheng, Frank Penczek, Clemens Grelck, Raimund Kirner, Bernd Scheuermann, Alex Shafarenko, 'Modeling Streams-based Variants of Ant Colony Optimisation for Parallel Systems' in Proceedings: 2nd HiPEAC Workshop on Feedback-Directed Compiler Optimization for Multi-Core Architectures. Berlin, Germany. 22 January 2013In this paper we present the implementation of a concurrent ant colony optimisation based solver for the combinatorial Single Machine Total Weighted Tardiness Problem (ACO- SMTWTP). We introduce S-Net, a coordination language based on dataflow principles, report on the performance of the implementation and compare it against a sequential and a parallel implementation of the same algorithm in C. As the workload of the optimisation algorithm is highly irregu- lar we consider this application to be an important use-case for runtime measurement directed optimisations of the co- ordination rogram as much as for guiding optimisations of numerical code

    Deployment of Deep Neural Networks on Dedicated Hardware Accelerators

    Get PDF
    Deep Neural Networks (DNNs) have established themselves as powerful tools for a wide range of complex tasks, for example computer vision or natural language processing. DNNs are notoriously demanding on compute resources and as a result, dedicated hardware accelerators for all use cases are developed. Different accelerators provide solutions from hyper scaling cloud environments for the training of DNNs to inference devices in embedded systems. They implement intrinsics for complex operations directly in hardware. A common example are intrinsics for matrix multiplication. However, there exists a gap between the ecosystems of applications for deep learning practitioners and hardware accelerators. HowDNNs can efficiently utilize the specialized hardware intrinsics is still mainly defined by human hardware and software experts. Methods to automatically utilize hardware intrinsics in DNN operators are a subject of active research. Existing literature often works with transformationdriven approaches, which aim to establish a sequence of program rewrites and data-layout transformations such that the hardware intrinsic can be used to compute the operator. However, the complexity this of task has not yet been explored, especially for less frequently used operators like Capsule Routing. And not only the implementation of DNN operators with intrinsics is challenging, also their optimization on the target device is difficult. Hardware-in-the-loop tools are often used for this problem. They use latency measurements of implementations candidates to find the fastest one. However, specialized accelerators can have memory and programming limitations, so that not every arithmetically correct implementation is a valid program for the accelerator. These invalid implementations can lead to unnecessary long the optimization time. This work investigates the complexity of transformation-driven processes to automatically embed hardware intrinsics into DNN operators. It is explored with a custom, graph-based intermediate representation (IR). While operators like Fully Connected Layers can be handled with reasonable effort, increasing operator complexity or advanced data-layout transformation can lead to scaling issues. Building on these insights, this work proposes a novel method to embed hardware intrinsics into DNN operators. It is based on a dataflow analysis. The dataflow embedding method allows the exploration of how intrinsics and operators match without explicit transformations. From the results it can derive the data layout and program structure necessary to compute the operator with the intrinsic. A prototype implementation for a dedicated hardware accelerator demonstrates state-of-the art performance for a wide range of convolutions, while being agnostic to the data layout. For some operators in the benchmark, the presented method can also generate alternative implementation strategies to improve hardware utilization, resulting in a geo-mean speed-up of ×2.813 while reducing the memory footprint. Lastly, by curating the initial set of possible implementations for the hardware-in-the-loop optimization, the median timeto- solution is reduced by a factor of ×2.40. At the same time, the possibility to have prolonged searches due a bad initial set of implementations is reduced, improving the optimization’s robustness by ×2.35
    • …
    corecore