279,483 research outputs found
Automated Design Space Exploration for optimised Deployment of DNN on Arm Cortex-A CPUs
The spread of deep learning on embedded devices has prompted the development
of numerous methods to optimise the deployment of deep neural networks (DNN).
Works have mainly focused on: i) efficient DNN architectures, ii) network
optimisation techniques such as pruning and quantisation, iii) optimised
algorithms to speed up the execution of the most computational intensive layers
and, iv) dedicated hardware to accelerate the data flow and computation.
However, there is a lack of research on cross-level optimisation as the space
of approaches becomes too large to test and obtain a globally optimised
solution. Thus, leading to suboptimal deployment in terms of latency, accuracy,
and memory. In this work, we first detail and analyse the methods to improve
the deployment of DNNs across the different levels of software optimisation.
Building on this knowledge, we present an automated exploration framework to
ease the deployment of DNNs. The framework relies on a Reinforcement Learning
search that, combined with a deep learning inference framework, automatically
explores the design space and learns an optimised solution that speeds up the
performance and reduces the memory on embedded CPU platforms. Thus, we present
a set of results for state-of-the-art DNNs on a range of Arm Cortex-A CPU
platforms achieving up to 4x improvement in performance and over 2x reduction
in memory with negligible loss in accuracy with respect to the BLAS
floating-point implementation
Recommended from our members
Computing infrastructure issues in distributed communications systems : a survey of operating system transport system architectures
The performance of distributed applications (such as file transfer, remote login, tele-conferencing, full-motion video, and scientific visualization) is influenced by several factors that interact in complex ways. In particular, application performance is significantly affected both by communication infrastructure factors and computing infrastructure factors. Several communication infrastructure factors include channel speed, bit-error rate, and congestion at intermediate switching nodes. Computing infrastructure factors include (among other things) both protocol processing activities (such as connection management, flow control, error detection, and retransmission) and general operating system factors (such as memory latency, CPU speed, interrupt and context switching overhead, process architecture, and message buffering). Due to a several orders of magnitude increase in network channel speed and an increase in application diversity, performance bottlenecks are shifting from the network factors to the transport system factors.This paper defines an abstraction called an "Operating System Transport System Architecture" (OSTSA) that is used to classify the major components and services in the computing infrastructure. End-to-end network protocols such as TCP, TP4, VMTP, XTP, and Delta-t typically run on general-purpose computers, where they utilize various operating system resources such as processors, virtual memory, and network controllers. The OSTSA provides services that integrate these resources to support distributed applications running on local and wide area networks.A taxonomy is presented to evaluate OSTSAs in terms of their support for protocol processing activities. We use this taxonomy to compare and contrast five general-purpose commercial and experimental operating systems including System V UNIX, BSD UNIX, the x-kernel, Choices, and Xinu
Task-based adaptive multiresolution for time-space multi-scale reaction-diffusion systems on multi-core architectures
A new solver featuring time-space adaptation and error control has been
recently introduced to tackle the numerical solution of stiff
reaction-diffusion systems. Based on operator splitting, finite volume adaptive
multiresolution and high order time integrators with specific stability
properties for each operator, this strategy yields high computational
efficiency for large multidimensional computations on standard architectures
such as powerful workstations. However, the data structure of the original
implementation, based on trees of pointers, provides limited opportunities for
efficiency enhancements, while posing serious challenges in terms of parallel
programming and load balancing. The present contribution proposes a new
implementation of the whole set of numerical methods including Radau5 and
ROCK4, relying on a fully different data structure together with the use of a
specific library, TBB, for shared-memory, task-based parallelism with
work-stealing. The performance of our implementation is assessed in a series of
test-cases of increasing difficulty in two and three dimensions on multi-core
and many-core architectures, demonstrating high scalability
An open and parallel multiresolution framework using block-based adaptive grids
A numerical approach for solving evolutionary partial differential equations
in two and three space dimensions on block-based adaptive grids is presented.
The numerical discretization is based on high-order, central finite-differences
and explicit time integration. Grid refinement and coarsening are triggered by
multiresolution analysis, i.e. thresholding of wavelet coefficients, which
allow controlling the precision of the adaptive approximation of the solution
with respect to uniform grid computations. The implementation of the scheme is
fully parallel using MPI with a hybrid data structure. Load balancing relies on
space filling curves techniques. Validation tests for 2D advection equations
allow to assess the precision and performance of the developed code.
Computations of the compressible Navier-Stokes equations for a temporally
developing 2D mixing layer illustrate the properties of the code for nonlinear
multi-scale problems. The code is open source
- …