193 research outputs found

    Characterization and Acceleration of High Performance Compute Workloads

    Get PDF

    Characterization and Acceleration of High Performance Compute Workloads

    Get PDF

    Modeling Data Center Co-Tenancy Performance Interference

    Get PDF
    A multi-core machine allows executing several applications simultaneously. Those jobs are scheduled on different cores and compete for shared resources such as the last level cache and memory bandwidth. Such competitions might cause performance degradation. Data centers often utilize virtualization to provide a certain level of performance isolation. However, some of the shared resources cannot be divided, even in a virtualized system, to ensure complete isolation. If the performance degradation of co-tenancy is not known to the cloud administrator, a data center often has to dedicate a whole machine for a latency-sensitive application to guarantee its quality of service. Co-run scheduling attempts to make good utilization of resources by scheduling compatible jobs into one machine while maintaining their service level agreements. An ideal co-run scheduling scheme requires accurate contention modeling. Recent studies for co-run modeling and scheduling have made steady progress to predict performance for two co-run applications sharing a specific system. This thesis advances co-tenancy modeling in three aspects. First, with an accurate co-run modeling for one system at hand, we propose a regression model to transfer the knowledge and create a model for a new system with different hardware configuration. Second, by examining those programs that yield high prediction errors, we further leverage clustering techniques to create a model for each group of applications that show similar behavior. Clustering helps improve the prediction accuracy of those pathological cases. Third, existing research is typically focused on modeling two application co-run cases. We extend a two-core model to a three- and four-core model by introducing a light-weight micro-kernel that emulates a complicated benchmark through program instrumentation. Our experimental evaluation shows that our cross-architecture model achieves an average prediction error less than 2% for pairwise co-runs across the SPECCPU2006 benchmark suite. For more than two application co-tenancy modeling, we show that our model is more scalable and can achieve an average prediction error of 2-3%

    Performance Analysis and Modeling of Parallel Applications in the Context of Architectural Rooflines

    Get PDF
    Understanding the performance of applications on modern multi- and manycore platforms is a difficult task and involves complex measurement, analysis, and modeling. The Roofline model is used to assess an application's performance on a given architecture. Not much work has been done with the Roofline model using real measurements. Because it can be a very useful tool for understanding application performance on a given architecture, in this thesis we demonstrate the use of architectural roofline data with measured data for analyzing the performance of different benchmarks. We first explain how to use different toolkits to measure the performance of a program. Next, these data are used to generate the roofline plots, based on which we can decide how can we make the application more efficient and remove bottlenecks. Our results show that this can be a powerful tool for analyzing performance of applications over different architectures and different code versions

    Contextual Bandit Modeling for Dynamic Runtime Control in Computer Systems

    Get PDF
    Modern operating systems and microarchitectures provide a myriad of mechanisms for monitoring and affecting system operation and resource utilization at runtime. Dynamic runtime control of these mechanisms can tailor system operation to the characteristics and behavior of the current workload, resulting in improved performance. However, developing effective models for system control can be challenging. Existing methods often require extensive manual effort, computation time, and domain knowledge to identify relevant low-level performance metrics, relate low-level performance metrics and high-level control decisions to workload performance, and to evaluate the resulting control models. This dissertation develops a general framework, based on the contextual bandit, for describing and learning effective models for runtime system control. Random profiling is used to characterize the relationship between workload behavior, system configuration, and performance. The framework is evaluated in the context of two applications of progressive complexity; first, the selection of paging modes (Shadow Paging, Hardware-Assisted Page) in the Xen virtual machine memory manager; second, the utilization of hardware memory prefetching for multi-core, multi-tenant workloads with cross-core contention for shared memory resources, such as the last-level cache and memory bandwidth. The resulting models for both applications are competitive in comparison to existing runtime control approaches. For paging mode selection, the resulting model provides equivalent performance to the state of the art while substantially reducing the computation requirements of profiling. For hardware memory prefetcher utilization, the resulting models are the first to provide dynamic control for hardware prefetchers using workload statistics. Finally, a correlation-based feature selection method is evaluated for identifying relevant low-level performance metrics related to hardware memory prefetching

    Resampling Methods and Visualization Tools for Computer Performance Comparisons in the Presence of Performance Variation

    Get PDF
    Performance variability, stemming from non-deterministic hardware and software behaviors or deterministic behaviors such as measurement bias, is a well-known phenomenon of computer systems which increases the difficulty of comparing computer performance metrics and is slated to become even more of a concern as interest in Big Data Analytics increases. Conventional methods use various measures (such as geometric mean) to quantify the performance of different benchmarks to compare computers without considering this variability which may lead to wrong conclusions. In this paper, we propose three resampling methods for performance evaluation and comparison: a randomization test for a general performance comparison between two computers, bootstrapping confidence estimation, and an empirical distribution and five-number-summary for performance evaluation. The results show that for both PARSEC and high-variance BigDataBench benchmarks: 1) the randomization test substantially improves our chance to identify the difference between performance comparisons when the difference is not large; 2) bootstrapping confidence estimation provides an accurate confidence interval for the performance comparison measure (e.g. ratio of geometric means); and 3) when the difference is very small, a single test is often not enough to reveal the nature of the computer performance due to the variability of computer systems. We further propose using empirical distribution to evaluate computer performance and a five-number-summary to summarize computer performance. We use published SPEC 2006 results to investigate the sources of performance variation by predicting performance and relative variation for 8,236 machines. We achieve a correlation of predicted performances of 0.992 and a correlation of predicted and measured relative variation of 0.5. Finally, we propose the utilization of a novel Biplotting technique to visualize the effectiveness of benchmarks and cluster machines by behavior. We illustrate the results and conclusion through detailed Monte Carlo simulation studies and real examples

    Using Locality and Interleaving Information to Improve Shared Cache Performance

    Get PDF
    The cache interference is found to play a critical role in optimizing cache allocation among concurrent threads for shared cache. Conventional LRU policy usually works well for low interference workloads, while high cache interference among threads demands explicit allocation regulation, such as cache partitioning. Cache interference is shown to be tied to inter-thread memory reference interleaving granularity: high interference is caused by ne-grain interleaving while low interference is caused coarse-grain interleaving. Proling of real multi-program workloads shows that cache set mapping and temporal phase result in the variation of interleaving granularity. When memory references from dierent threads map to disjoint cache sets, or they occur in distinct time windows, they tend to cause little interference due to coarse-grain interleaving. The interleaving granularity measured by runlength in workloads is found to correlate with the preference of cache management policy: ne-grain interleaving workloads perform better with cache partitioning, and coarse-grain interleaving workloads perform better with LRU. Most existing shared cache management techniques are based on working set locality analysis. This dissertation studies the shared cache performance by taking both locality and interleaving information into consideration. Oracle algorithm which provides theoretical best performance is investigated to provide insight into how to design a better practical policy. Proling and analysis of Oracle algorithm lead to the proposal of probabilistic replacement (PR), a novel cache allocation policy. With aggressor threads information learned on-line, PR evicts the bad locality blocks of aggressor threads probabilistically while preserving good locality blocks of non-aggressor threads. PR is shown to be able to adapt to the different interleaving granularities in different sets over time. Its flexibility in tuning eviction probability also improves fairness among thread performance. Evaluation indicates that PR outperforms LRU, UCP, and ideal cache partitioning at moderate hardware cost. For single program cache management, this dissertation also proposes a novel technique: reuse distance last touch predictor (RD-LTP). RD-LTP is able to capture reuse distance information, which represents the intrinsic memory reference pattern. Based on this improved LT predictor, an MRU LT eviction policy is developed to select the right victim at the presence of incorrect LT prediction. In addition to LT predictor, another predictor: reuse distance predictors (RDPs) is proposed, which is able to predict actual reuse distance values. Compared to various existing cache management techniques, these two novel predictors deliver higher cache performance with higher prediction coverage and accuracy at moderate hardware cost

    Unifying hardware and software benchmarking: a resource-agnostic model

    Get PDF
    Lilja (2005) states that “In the field of computer science and engineering there is surprisingly little agreement on how to measure something as fun- damental as the performance of a computer system.”. The field lacks of the most fundamental element for sharing measures and results: an appropriate metric to express performance. Since the introduction of laptops and mobile devices, there has been a strong research focus towards the energy efficiency of hardware. Many papers, both from academia and industrial research labs, focus on methods and ideas to lower power consumption in order to lengthen the battery life of portable device components. Much less effort has been spent on defining the responsibility of software in the overall computational system energy consumption. Some attempts have been made to describe the energy behaviour of software, but none of them abstract from the physical machine where the measurements were taken. In our opinion this is a strong drawback because results can not be generalized. In this work we attempt to bridge the gap between characterization and prediction, of both hardware and software, of performance and energy, in a single unified model. We propose a model designed to be as simple as possible, generic enough to be abstract from the specific resource being described or predicted (applying to both time, memory and energy), but also concrete and practical, allowing useful and precise performance and energy predictions. The model applies to the broadest set of resource possible. We focus mainly on time and memory (hence bridging hardware benchmarking and classical algorithms time complexity), and energy consumption. To ensure a wide applicability of the model in real world scenario, the model is completely black-box, it does not require any information about the source code of the program, and only relies on external metrics, like completion time, energy consumption, or performance counters. Extending the benchmarking model, we define the notion of experimental computational complexity, as the characterization of how the resource usage changes as the input size grows. Finally, we define a high-level energy model capable of characterizing the power consumption of computers and clusters, in terms of the usage of resources as defined by our benchmarking model. We tested our model in four experiments: Expressiveness: we show the close relationship between energy and clas- sical theoretical complexity, also showing that our experimental com- putational complexity is expressive enough to capture interesting be- haviour of programs simply analysing their resource usage. Performance prediction we use the large database of performance mea- sures available on the CPU SPEC website to train our model and predict the performance of the CPU SPEC suite on randomly selected computers. Energy profiling: we tested our model to characterize and predict the power usage of a cluster running OpenFOAM, changing the number of active nodes and cores. Scheduling: on heterogeneous systems applying our performance pre- diction model to features of programs extracted at runtime, we predict the device where is most convenient to execute the programs, in an heterogeneous system

    Workload-Aware Database Monitoring and Consolidation

    Get PDF
    In most enterprises, databases are deployed on dedicated database servers. Often, these servers are underutilized much of the time. For example, in traces from almost 200 production servers from different organizations, we see an average CPU utilization of less than 4%. This unused capacity can be potentially harnessed to consolidate multiple databases on fewer machines, reducing hardware and operational costs. Virtual machine (VM) technology is one popular way to approach this problem. However, as we demonstrate in this paper, VMs fail to adequately support database consolidation, because databases place a unique and challenging set of demands on hardware resources, which are not well-suited to the assumptions made by VM-based consolidation. Instead, our system for database consolidation, named Kairos, uses novel techniques to measure the hardware requirements of database workloads, as well as models to predict the combined resource utilization of those workloads. We formalize the consolidation problem as a non-linear optimization program, aiming to minimize the number of servers and balance load, while achieving near-zero performance degradation. We compare Kairos against virtual machines, showing up to a factor of 12Ă— higher throughput on a TPC-C-like benchmark. We also tested the effectiveness of our approach on real-world data collected from production servers at Wikia.com, Wikipedia, Second Life, and MIT CSAIL, showing absolute consolidation ratios ranging between 5.5:1 and 17:1
    • …
    corecore