110 research outputs found

    Implementing the 2-D Wavelet Transform on SIMD-Enhanced General-Purpose Processors

    Full text link

    On the design of architecture-aware algorithms for emerging applications

    Get PDF
    This dissertation maps various kernels and applications to a spectrum of programming models and architectures and also presents architecture-aware algorithms for different systems. The kernels and applications discussed in this dissertation have widely varying computational characteristics. For example, we consider both dense numerical computations and sparse graph algorithms. This dissertation also covers emerging applications from image processing, complex network analysis, and computational biology. We map these problems to diverse multicore processors and manycore accelerators. We also use new programming models (such as Transactional Memory, MapReduce, and Intel TBB) to address the performance and productivity challenges in the problems. Our experiences highlight the importance of mapping applications to appropriate programming models and architectures. We also find several limitations of current system software and architectures and directions to improve those. The discussion focuses on system software and architectural support for nested irregular parallelism, Transactional Memory, and hybrid data transfer mechanisms. We believe that the complexity of parallel programming can be significantly reduced via collaborative efforts among researchers and practitioners from different domains. This dissertation participates in the efforts by providing benchmarks and suggestions to improve system software and architectures.Ph.D.Committee Chair: Bader, David; Committee Member: Hong, Bo; Committee Member: Riley, George; Committee Member: Vuduc, Richard; Committee Member: Wills, Scot

    Automated Design Space Exploration for optimised Deployment of DNN on Arm Cortex-A CPUs

    Full text link
    The spread of deep learning on embedded devices has prompted the development of numerous methods to optimise the deployment of deep neural networks (DNN). Works have mainly focused on: i) efficient DNN architectures, ii) network optimisation techniques such as pruning and quantisation, iii) optimised algorithms to speed up the execution of the most computational intensive layers and, iv) dedicated hardware to accelerate the data flow and computation. However, there is a lack of research on cross-level optimisation as the space of approaches becomes too large to test and obtain a globally optimised solution. Thus, leading to suboptimal deployment in terms of latency, accuracy, and memory. In this work, we first detail and analyse the methods to improve the deployment of DNNs across the different levels of software optimisation. Building on this knowledge, we present an automated exploration framework to ease the deployment of DNNs. The framework relies on a Reinforcement Learning search that, combined with a deep learning inference framework, automatically explores the design space and learns an optimised solution that speeds up the performance and reduces the memory on embedded CPU platforms. Thus, we present a set of results for state-of-the-art DNNs on a range of Arm Cortex-A CPU platforms achieving up to 4x improvement in performance and over 2x reduction in memory with negligible loss in accuracy with respect to the BLAS floating-point implementation

    Adaptive Hybrid Storage Format for Sparse Matrix–Vector Multiplication on Multi-Core SIMD CPUs

    Get PDF
    Optimizing sparse matrix–vector multiplication (SpMV) is challenging due to the non-uniform distribution of the non-zero elements of the sparse matrix. The best-performing SpMV format changes depending on the input matrix and the underlying architecture, and there is no “one-size-fit-for-all” format. A hybrid scheme combining multiple SpMV storage formats allows one to choose an appropriate format to use for the target matrix and hardware. However, existing hybrid approaches are inadequate for utilizing the SIMD cores of modern multi-core CPUs with SIMDs, and it remains unclear how to best mix different SpMV formats for a given matrix. This paper presents a new hybrid storage format for sparse matrices, specifically targeting multi-core CPUs with SIMDs. Our approach partitions the target sparse matrix into two segmentations based on the regularities of the memory access pattern, where each segmentation is stored in a format suitable for its memory access patterns. Unlike prior hybrid storage schemes that rely on the user to determine the data partition among storage formats, we employ machine learning to build a predictive model to automatically determine the partition threshold on a per matrix basis. Our predictive model is first trained off line, and the trained model can be applied to any new, unseen sparse matrix. We apply our approach to 956 matrices and evaluate its performance on three distinct multi-core CPU platforms: a 72-core Intel Knights Landing (KNL) CPU, a 128-core AMD EPYC CPU, and a 64-core Phytium ARMv8 CPU. Experimental results show that our hybrid scheme, combined with the predictive model, outperforms the best-performing alternative by 2.9%, 17.5% and 16% on average on KNL, AMD, and Phytium, respectively

    Libra: Achieving Efficient Instruction- and Data- Parallel Execution for Mobile Applications.

    Full text link
    Mobile computing as exemplified by the smart phone has become an integral part of our daily lives. The next generation of these devices will be driven by providing richer user experiences and compelling capabilities: higher definition multimedia, 3D graphics, augmented reality, and voice interfaces. To meet these goals, the core computing capabilities of the smart phone must be scaled. But, the energy budgets are increasing at a much lower rate, thus fundamental improvements in computing efficiency must be garnered. To meet this challenge, computer architects employ hardware accelerators in the form of SIMD and VLIW. Single-instruction multiple-data (SIMD) accelerators provide high degrees of scalability for applications rich in data-level parallelism (DLP). Very long instruction word (VLIW) accelerators provide moderate scalability for applications with high degrees of instruction-level parallelism (ILP). Unfortunately, applications are not so nicely partitioned into two groups: many applications have some DLP, but also contain significant fractions of code with low trip count loops, complex control/data dependences, or non-uniform execution behavior for which no DLP exists. Therefore, a more adaptive accelerator is required to be able to deploy resources as needed: exploit DLP on SIMD when it’s available, but fall back to ILP on the same hardware when necessary. In this thesis, we first focus on various compiler solutions that solve inefficiency problem in both VLIW and SIMD accelerators. For SIMD accelerators, a new vectorization pass, called SIMD Defragmenter, is introduced to uncover hidden DLP using subgraph identification in SIMD accelerators. CGRA express effectively accelerates sequential code regions using a bypass network in VLIW accelerators, and Resource Recycling leverages stream-graph modulo scheduling technique for scheduling of multiple code regions in multi-core accelerators. Second, we propose the new scalable multicore accelerator referred to as Libra for mobile systems, which can support execution of code regions having both DLP and ILP, as well as hybrid combinations of the two. We believe that as industry requires higher performance, the proposed flexible accelerator and compiler support will put more resources to work in order to meet the performance and power efficiency requirements.PHDElectrical EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/99840/1/yjunpark_1.pd

    Accelerating Deep Learning Inference on Mobile Systems

    Get PDF
    Artificial Intelligence on the edge is a matter of great importance towards the enhancement of smart devices that rely on operations with real-time constraints. We present PolimiDL, a framework for the acceleration of Deep Learning on mobile and embedded systems with limited resources and heterogeneous architectures. Experimental results show competitive results with respect to TensorFlow Lite for the execution of small models

    Video Coding Performance

    Get PDF
    • …
    corecore