27 research outputs found
Predicting the Performance of a Computing System with Deep Networks
Predicting the performance and energy consumption of computing hardware is critical for many modern applications. This will inform procurement decisions, deployment decisions, and autonomic scaling. Existing approaches to understanding the performance of hardware largely focus around benchmarking ā leveraging standardised workloads which seek to be representative of an end-userās needs. Two key challenges are present; benchmark workloads may not be representative of an end-userās workload, and benchmark scores are not easily obtained for all hardware. Within this paper, we demonstrate the potential to build Deep Learning models to predict benchmark scores for unseen hardware. We undertake our evaluation with the openly available SPEC 2017 benchmark results. We evaluate three different networks, one fully-connected network along with two Convolutional Neural Networks (one bespoke and one ResNet inspired) and demonstrate impressive 2 scores of 0.96, 0.98 and 0.94 respectively
X-MAP A Performance Prediction Tool for Porting Algorithms and Applications to Accelerators
Most modern high-performance computing systems comprise of one or more accelerators with varying architectures in addition to traditional multicore Central Processing Units (CPUs). Examples of these accelerators include Graphic Processing Units (GPU) and Intelās Many Integrated Cores architecture called Xeon Phi (PHI). These architectures provide massive parallel computation capabilities, which provide substantial performance beneļ¬ts over traditional CPUs for a variety of scientiļ¬c applications. We know that all accelerators are not similar because each of them has their own unique architecture. This diļ¬erence in the underlying architecture plays a crucial role in determining if a given accelerator will provide a signiļ¬cant speedup over its competition. In addition to the architecture itself, one more diļ¬erentiating factor for these accelerators is the programming language used to program them. For example, Nvidia GPUs can be programmed using Compute Uniļ¬ed Device Architecture (CUDA) and OpenCL while Intel Xeon PHIs can be programmed using OpenMP and OpenCL. The choice of programming language also plays a critical role in the speedup obtained depending on how close the language is to the hardware in addition to the level of optimization. With that said, it is thus very diļ¬cult for an application developer to choose the ideal accelerator to achieve the best possible speedup. In light of this, we present an easy to use Graphical User Interface (GUI) Tool called X-MAP which is a performance prediction tool for porting algorithms and applications to architectures which encompasses a Machine Learning based inference model to predict the performance of an applica-tion on a number of well-known accelerators and at the same time predict the best architecture and programming language for the application. We do this by collecting hardware counters from a given application and predicting run time by providing this data as inputs to a Neural Network Regressor based inference model. We predict the architecture and associated programming language by pro
viding the hardware counters as inputs to an inference model based on Random Forest Classiļ¬cation Model. Finally, with a mean absolute prediction error of 8.52 and features such as syntax high-lighting for multiple programming languages, a function-wise breakdown of the entire application to understand bottlenecks and the ability for end users to submit their own prediction models to further improve the system, makes X-MAP a unique tool that has a signiļ¬cant edge over existing performance prediction solutions
Analyzing the Stability of Relative Performance Differences Between Cloud and Embedded Environments
There has been a shift towards the software-defined vehicle in the automotive industry in recent years. In order to enable the correct behaviour of critical as well as non-critical software functions, like those found in Autonomous Driving/Driver Assistance subsystems, extensive software testing needs to be performed. The usage of embedded hardware for these tests is either very expensive or takes a prohibitively long time in relation to the fast development cycles in the industry. To reduce development bottlenecks, test frameworks executed in cloud environments that leverage the scalability of the cloud are an essential part of the development process. However, relying on more performant cloud hardware for the majority of tests means that performance problems will only become apparent in later development phases when software is deployed to the real target. However, if the performance relation between executing in the cloud and on the embedded target can be approximated with sufficient precision, the expressiveness of the executed tests can be improved. Moreover, as a fully integrated system consists of a large number of software components that, at any given time, exhibit an unknown mix of best-/average-/worst-case behaviour, it is critical to know whether the performance relation differs depending on the inputs. In this paper, we examine the relative performance differences between a physical ARM-based chipset and a cloud-based ARM-based virtual machine, using a generic benchmark and 2 algorithms representative of typical automotive workloads, modified to generate best-/average-/worst-case behaviour in a reproducible and controlled way and assess the performance differences. We determine that the performance difference factor is between 1.8 and 3.6 for synthetic benchmarks and around 2.0-2.8 for more representative benchmarks. These results indicate that it may be possible to relate cloud to embedded performance with acceptable precision, especially when workload characterization is taken into account
Learning Independent Program and Architecture Representations for Generalizable Performance Modeling
This paper proposes PerfVec, a novel deep learning-based performance modeling
framework that learns high-dimensional, independent/orthogonal program and
microarchitecture representations. Once learned, a program representation can
be used to predict its performance on any microarchitecture, and likewise, a
microarchitecture representation can be applied in the performance prediction
of any program. Additionally, PerfVec yields a foundation model that captures
the performance essence of instructions, which can be directly used by
developers in numerous performance modeling related tasks without incurring its
training cost. The evaluation demonstrates that PerfVec is more general,
efficient, and accurate than previous approaches
LEAPER: Fast and Accurate FPGA-based System Performance Prediction via Transfer Learning
Machine learning has recently gained traction as a way to overcome the slow
accelerator generation and implementation process on an FPGA. It can be used to
build performance and resource usage models that enable fast early-stage design
space exploration. First, training requires large amounts of data (features
extracted from design synthesis and implementation tools), which is
cost-inefficient because of the time-consuming accelerator design and
implementation process. Second, a model trained for a specific environment
cannot predict performance or resource usage for a new, unknown environment. In
a cloud system, renting a platform for data collection to build an ML model can
significantly increase the total-cost-ownership (TCO) of a system. Third,
ML-based models trained using a limited number of samples are prone to
overfitting. To overcome these limitations, we propose LEAPER, a transfer
learning-based approach for prediction of performance and resource usage in
FPGA-based systems. The key idea of LEAPER is to transfer an ML-based
performance and resource usage model trained for a low-end edge environment to
a new, high-end cloud environment to provide fast and accurate predictions for
accelerator implementation. Experimental results show that LEAPER (1) provides,
on average across six workloads and five FPGAs, 85% accuracy when we use our
transferred model for prediction in a cloud environment with 5-shot learning
and (2) reduces design-space exploration time for accelerator implementation on
an FPGA by 10x, from days to only a few hours