1,526 research outputs found

    Combining hardware and software instrumentation to classify program executions

    Get PDF
    Several research efforts have studied ways to infer properties of software systems from program spectra gathered from the running systems, usually with software-level instrumentation. While these efforts appear to produce accurate classifications, detailed understanding of their costs and potential cost-benefit tradeoffs is lacking. In this work we present a hybrid instrumentation approach which uses hardware performance counters to gather program spectra at very low cost. This underlying data is further augmented with data captured by minimal amounts of software-level instrumentation. We also evaluate this hybrid approach by comparing it to other existing approaches. We conclude that these hybrid spectra can reliably distinguish failed executions from successful executions at a fraction of the runtime overhead cost of using software-based execution data

    COST Action IC 1402 ArVI: Runtime Verification Beyond Monitoring -- Activity Report of Working Group 1

    Full text link
    This report presents the activities of the first working group of the COST Action ArVI, Runtime Verification beyond Monitoring. The report aims to provide an overview of some of the major core aspects involved in Runtime Verification. Runtime Verification is the field of research dedicated to the analysis of system executions. It is often seen as a discipline that studies how a system run satisfies or violates correctness properties. The report exposes a taxonomy of Runtime Verification (RV) presenting the terminology involved with the main concepts of the field. The report also develops the concept of instrumentation, the various ways to instrument systems, and the fundamental role of instrumentation in designing an RV framework. We also discuss how RV interplays with other verification techniques such as model-checking, deductive verification, model learning, testing, and runtime assertion checking. Finally, we propose challenges in monitoring quantitative and statistical data beyond detecting property violation

    Seer: a lightweight online failure prediction approach

    Get PDF
    Online failure prediction aims to predict the manifestation of failures at runtime before the failures actually occur. Existing online failure prediction approaches typically operate on data which is either directly reported by the system under test or directly observable from outside system executions. These approaches generally refrain themselves from collecting internal execution data that can further improve the prediction quality. One reason behind this general trend is due to the runtime overhead cost incurred by the measurement instruments that are required to collect the data. In this work we conjecture that large cost reductions in collecting internal execution data for online failure prediction can derive from reducing the cost of the measurement instruments, while still supporting acceptable levels of prediction quality. To evaluate this conjecture, we present a lightweight online failure prediction approach, called Seer. Seer uses fast hardware performance counters to perform most of the data collection work. The data is augmented with further data collected by a minimal amount of software instrumentation that is added to the systems software. We refer to the data collected in this manner as hybrid spectra. We applied the proposed approach to three widely used open source subject applications and evaluated it by comparing and contrasting three types of hybrid spectra and two types of traditional software spectra. At the lowest level of runtime overheads attained in the experiments, the hybrid spectra predicted the failures about half way through the executions with an F-measure of 0.77 and a runtime overhead of 1.98%, on average. Comparing hybrid spectra to software spectra, we observed that, for comparable runtime overhead levels, the hybrid spectra provided significantly better prediction accuracies and earlier warnings for failures than the software spectra. Alternatively, for comparable accuracy levels, the hybrid spectra incurred significantly less runtime overheads and provided earlier warnings

    Predicting robustness against transient faults of MPI based programs

    Get PDF
    The evaluation of a program's behaviour in the presence of transient faults is often a very time consuming work. In order to achieve significant data, thousands of executions are required and each execution will have the significant overhead of the fault injection environment. A previously published methodology reduced significantly the time needed to evaluate the robustness of a program execution by exhaustively analysing its execution trace instead of using fault injection. In this paper we present a further improvement in the evaluation time of parallel programs robustness against transient faults by combining this methodology with PAS2P - a method that strives to describe an application based on its message-passing activity. This combination allowed us to predict the robustness of larger parallel programs, reducing in some cases by more than 20 times the time needed to calculate the robustness while obtaining a robustness prediction error of less than 4%

    TaskInsight: Understanding Task Schedules Effects on Memory and Performance

    Get PDF
    Recent scheduling heuristics for task-based applications have managed to improve their by taking into account memory-related properties such as data locality and cache sharing. However, there is still a general lack of tools that can provide insights into why, and where, different schedulers improve memory behavior, and how this is related to the applications' performance. To address this, we present TaskInsight, a technique to characterize the memory behavior of different task schedulers through the analysis of data reuse between tasks. TaskInsight provides high-level, quantitative information that can be correlated with tasks' performance variation over time to understand data reuse through the caches due to scheduling choices. TaskInsight is useful to diagnose and identify which scheduling decisions affected performance, when were they taken, and why the performance changed, both in single and multi-threaded executions. We demonstrate how TaskInsight can diagnose examples where poor scheduling caused over 10% difference in performance for tasks of the same type, due to changes in the tasks' data reuse through the private and shared caches, in single and multi-threaded executions of the same application. This flexible insight is key for optimization in many contexts, including data locality, throughput, memory footprint or even energy efficiency.We thank the reviewers for their feedback. This work was supported by the Swedish Research Council, the Swedish Foundation for Strategic Research project FFL12-0051 and carried out within the Linnaeus Centre of Excellence UPMARC, Uppsala Programming for Multicore Architectures Research Center. This paper was also published with the support of the HiPEAC network that received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement no. 687698.Peer ReviewedPostprint (published version

    Tracing and profiling machine learning dataflow applications on GPU

    Get PDF
    In this paper, we propose a profiling and tracing method for dataflow applications with GPU acceleration. Dataflow models can be represented by graphs and are widely used in many domains like signal processing or machine learning. Within the graph, the data flows along the edges, and the nodes correspond to the computing units that process the data. To accelerate the execution, some co-processing units, like GPUs, are often used for computing intensive nodes. The work in this paper aims at providing useful information about the execution of the dataflow graph on the available hardware, in order to understand and possibly improve the performance. The collected traces include low-level information about the CPU, from the Linux Kernel (system calls), as well as mid-level and high-level information respectively about intermediate libraries like CUDA, HIP or HSA, and the dataflow model. This is followed by post-mortem analysis and visualization steps in order to enhance the trace and show useful information to the user. To demonstrate the effectiveness of the method, it was evaluated for TensorFlow, a well-known machine learning library that uses a dataflow computational graph to represent the algorithms. We present a few examples of machine learning applications that can be optimized with the help of the information provided by our proposed method. For example, we reduce the execution time of a face recognition application by a factor of 5X. We suggest a better placement of the computation nodes on the available hardware components for a distributed application. Finally, we also enhance the memory management of an application to speed up the execution

    Vulnerability-Tolerant Architectures for Resource-Constrained Devices

    Get PDF
    L'abstract è presente nell'allegato / the abstract is in the attachmen
    corecore