27 research outputs found

    Predicting the Performance of a Computing System with Deep Networks

    Get PDF
    Predicting the performance and energy consumption of computing hardware is critical for many modern applications. This will inform procurement decisions, deployment decisions, and autonomic scaling. Existing approaches to understanding the performance of hardware largely focus around benchmarking ā€“ leveraging standardised workloads which seek to be representative of an end-userā€™s needs. Two key challenges are present; benchmark workloads may not be representative of an end-userā€™s workload, and benchmark scores are not easily obtained for all hardware. Within this paper, we demonstrate the potential to build Deep Learning models to predict benchmark scores for unseen hardware. We undertake our evaluation with the openly available SPEC 2017 benchmark results. We evaluate three different networks, one fully-connected network along with two Convolutional Neural Networks (one bespoke and one ResNet inspired) and demonstrate impressive 2 scores of 0.96, 0.98 and 0.94 respectively

    X-MAP A Performance Prediction Tool for Porting Algorithms and Applications to Accelerators

    Get PDF
    Most modern high-performance computing systems comprise of one or more accelerators with varying architectures in addition to traditional multicore Central Processing Units (CPUs). Examples of these accelerators include Graphic Processing Units (GPU) and Intelā€™s Many Integrated Cores architecture called Xeon Phi (PHI). These architectures provide massive parallel computation capabilities, which provide substantial performance beneļ¬ts over traditional CPUs for a variety of scientiļ¬c applications. We know that all accelerators are not similar because each of them has their own unique architecture. This diļ¬€erence in the underlying architecture plays a crucial role in determining if a given accelerator will provide a signiļ¬cant speedup over its competition. In addition to the architecture itself, one more diļ¬€erentiating factor for these accelerators is the programming language used to program them. For example, Nvidia GPUs can be programmed using Compute Uniļ¬ed Device Architecture (CUDA) and OpenCL while Intel Xeon PHIs can be programmed using OpenMP and OpenCL. The choice of programming language also plays a critical role in the speedup obtained depending on how close the language is to the hardware in addition to the level of optimization. With that said, it is thus very diļ¬ƒcult for an application developer to choose the ideal accelerator to achieve the best possible speedup. In light of this, we present an easy to use Graphical User Interface (GUI) Tool called X-MAP which is a performance prediction tool for porting algorithms and applications to architectures which encompasses a Machine Learning based inference model to predict the performance of an applica-tion on a number of well-known accelerators and at the same time predict the best architecture and programming language for the application. We do this by collecting hardware counters from a given application and predicting run time by providing this data as inputs to a Neural Network Regressor based inference model. We predict the architecture and associated programming language by pro viding the hardware counters as inputs to an inference model based on Random Forest Classiļ¬cation Model. Finally, with a mean absolute prediction error of 8.52 and features such as syntax high-lighting for multiple programming languages, a function-wise breakdown of the entire application to understand bottlenecks and the ability for end users to submit their own prediction models to further improve the system, makes X-MAP a unique tool that has a signiļ¬cant edge over existing performance prediction solutions

    Analyzing the Stability of Relative Performance Differences Between Cloud and Embedded Environments

    Get PDF
    There has been a shift towards the software-defined vehicle in the automotive industry in recent years. In order to enable the correct behaviour of critical as well as non-critical software functions, like those found in Autonomous Driving/Driver Assistance subsystems, extensive software testing needs to be performed. The usage of embedded hardware for these tests is either very expensive or takes a prohibitively long time in relation to the fast development cycles in the industry. To reduce development bottlenecks, test frameworks executed in cloud environments that leverage the scalability of the cloud are an essential part of the development process. However, relying on more performant cloud hardware for the majority of tests means that performance problems will only become apparent in later development phases when software is deployed to the real target. However, if the performance relation between executing in the cloud and on the embedded target can be approximated with sufficient precision, the expressiveness of the executed tests can be improved. Moreover, as a fully integrated system consists of a large number of software components that, at any given time, exhibit an unknown mix of best-/average-/worst-case behaviour, it is critical to know whether the performance relation differs depending on the inputs. In this paper, we examine the relative performance differences between a physical ARM-based chipset and a cloud-based ARM-based virtual machine, using a generic benchmark and 2 algorithms representative of typical automotive workloads, modified to generate best-/average-/worst-case behaviour in a reproducible and controlled way and assess the performance differences. We determine that the performance difference factor is between 1.8 and 3.6 for synthetic benchmarks and around 2.0-2.8 for more representative benchmarks. These results indicate that it may be possible to relate cloud to embedded performance with acceptable precision, especially when workload characterization is taken into account

    Learning Independent Program and Architecture Representations for Generalizable Performance Modeling

    Full text link
    This paper proposes PerfVec, a novel deep learning-based performance modeling framework that learns high-dimensional, independent/orthogonal program and microarchitecture representations. Once learned, a program representation can be used to predict its performance on any microarchitecture, and likewise, a microarchitecture representation can be applied in the performance prediction of any program. Additionally, PerfVec yields a foundation model that captures the performance essence of instructions, which can be directly used by developers in numerous performance modeling related tasks without incurring its training cost. The evaluation demonstrates that PerfVec is more general, efficient, and accurate than previous approaches

    LEAPER: Fast and Accurate FPGA-based System Performance Prediction via Transfer Learning

    Full text link
    Machine learning has recently gained traction as a way to overcome the slow accelerator generation and implementation process on an FPGA. It can be used to build performance and resource usage models that enable fast early-stage design space exploration. First, training requires large amounts of data (features extracted from design synthesis and implementation tools), which is cost-inefficient because of the time-consuming accelerator design and implementation process. Second, a model trained for a specific environment cannot predict performance or resource usage for a new, unknown environment. In a cloud system, renting a platform for data collection to build an ML model can significantly increase the total-cost-ownership (TCO) of a system. Third, ML-based models trained using a limited number of samples are prone to overfitting. To overcome these limitations, we propose LEAPER, a transfer learning-based approach for prediction of performance and resource usage in FPGA-based systems. The key idea of LEAPER is to transfer an ML-based performance and resource usage model trained for a low-end edge environment to a new, high-end cloud environment to provide fast and accurate predictions for accelerator implementation. Experimental results show that LEAPER (1) provides, on average across six workloads and five FPGAs, 85% accuracy when we use our transferred model for prediction in a cloud environment with 5-shot learning and (2) reduces design-space exploration time for accelerator implementation on an FPGA by 10x, from days to only a few hours
    corecore