Search CORE

4,042 research outputs found

Optimise web browsing on heterogeneous mobile platforms:a machine learning based approach

Author: Gao Ling
Ren Jie
Wang Hai
Wang Zheng
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 05/10/2017
Field of study

Web browsing is an activity that billions of mobile users perform on a daily basis. Battery life is a primary concern to many mobile users who often find their phone has died at most inconvenient times. The heterogeneous mobile architecture is a solution for energy-efficient mobile web browsing. However, the current mobile web browsers rely on the operating system to exploit the underlying architecture, which has no knowledge of the individual web workload and often leads to poor energy efficiency. This paper describes an automatic approach to render mobile web workloads for performance and energy efficiency. It achieves this by developing a machine learning based approach to predict which processor to use to run the web browser rendering engine and at what frequencies the processor cores of the system should operate. Our predictor learns offline from a set of training web workloads. The built predictor is then integrated into the browser to predict the optimal processor configuration at runtime, taking into account the web workload characteristics and the optimisation goal: whether it is load time, energy consumption or a trade-off between them. We evaluate our approach on a representative ARM big.LITTLE mobile architecture using the hottest 500 webpages. Our approach achieves 80% of the performance delivered by an ideal predictor. We obtain, on average, 45%, 63.5% and 81% improvement respectively for load time, energy consumption and the energy delay product, when compared to the Linux governo

Lancaster E-Prints

Source Code Classification for Energy Efficiency in Parallel Ultra Low-Power Microcontrollers

Author: Acquaviva A.
Barchi F.
Bartolini A.
Parisi E.
Tagliavini G.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2021
Field of study

The analysis of source code through machine learning techniques is an increasingly explored research topic aiming at increasing smartness in the software toolchain to exploit modern architectures in the best possible way. In the case of low-power, parallel embedded architectures, this means finding the configuration, for instance in terms of the number of cores, leading to minimum energy consumption. Depending on the kernel to be executed, the energy optimal scaling configuration is not trivial. While recent work has focused on general-purpose systems to learn and predict the best execution target in terms of the execution time of a snippet of code or kernel (e.g. offload OpenCL kernel on multicore CPU or GPU), in this work we focus on static compile-time features to assess if they can be successfully used to predict the minimum energy configuration on PULP, an ultra-low-power architecture featuring an on-chip cluster of RISC-V processors. Experiments show that using machine learning models on the source code to select the best energy scaling configuration automatically is viable and has the potential to be used in the context of automatic system configuration for energy minimisation

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna

Adaptive optimization for OpenCL programs on embedded heterogeneous systems

Author: Cho Y.
Garzón E.
Grewe D.
Imes C.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 21/06/2017
Field of study

Heterogeneous multi-core architectures consisting of CPUs and GPUs are commonplace in today’s embedded systems. These architectures offer potential for energy efficient computing if the application task is mapped to the right core. Realizing such potential is challenging due to the complex and evolving nature of hardware and applications. This paper presents an automatic approach to map OpenCL kernels onto heterogeneous multi-cores for a given optimization criterion – whether it is faster runtime, lower energy consumption or a trade-off between them. This is achieved by developing a machine learning based approach to predict which processor to use to run the OpenCL kernel and the host program, and at what frequency the processor should operate. Instead of hand-tuning a model for each optimization metric, we use machine learning to develop a unified framework that first automatically learns the optimization heuristic for each metric off-line, then uses the learned knowledge to schedule OpenCL kernels at runtime based on code and runtime information of the program. We apply our approach to a set of representative OpenCL benchmarks and evaluate it on an ARM big.LITTLE mobile platform. Our approach achieves over 93% of the performance delivered by a perfect predictor.We obtain, on average, 1.2x, 1.6x, and 1.8x improvement respectively for runtime, energy consumption and the energy delay product when compared to a comparative heterogeneous-aware OpenCL task mapping scheme

Crossref

Lancaster E-Prints

Auto-Tuning MPI Collective Operations on Large-Scale Parallel Systems

Author: Fang J
Huang C
Juan C
Pan X
Sun X
Tang T
Wang H
Wang Z
Wu F
Xie M
Yuan Y
Zheng W
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 03/10/2019
Field of study

MPI libraries are widely used in applications of high performance computing. Yet, effective tuning of MPI collectives on large parallel systems is an outstanding challenge. This process often follows a trial-and-error approach and requires expert insights into the subtle interactions between software and the underlying hardware. This paper presents an empirical approach to choose and switch MPI communication algorithms at runtime to optimize the application performance. We achieve this by first modeling offline, through microbenchmarks, to find how the runtime parameters with different message sizes affect the choice of MPI communication algorithms. We then apply the knowledge to automatically optimize new unseen MPI programs. We evaluate our approach by applying it to NPB and HPCC benchmarks on a 384-node computer cluster of the Tianhe-2 supercomputer. Experimental results show that our approach achieves, on average, 22.7% (up to 40.7%) improvement over the default setting

Crossref

White Rose Research Online