279 research outputs found

    Epoch profiles: microarchitecture-based application analysis and optimization

    Get PDF
    The performance of data-intensive applications, when running on modern multi- and many-core processors, is largely determined by their memory access behavior. Its most important contributors are the frequency and latency of off-chip accesses and the extent to which long-latency memory accesses can be overlapped with useful computation or with each other. In this paper we present two methods to better understand application and microarchitectural interactions. An epoch profile is an intuitive way to understand the relationships between three important characteristics: the on-chip cache size, the size of the reorder window of an out-of-order processor, and the frequency of processor stalls caused by long-latency, off-chip requests (epochs). By relating these three quantities one can more easily understand an application’s memory reference behavior and thus significantly reduce the design space. While epoch profiles help to provide insight into the behavior of a single application, developing an understanding of a number of applications in the presence of area and core count constraints presents additional challenges. Epoch-based microarchitectural analysis is presented as a better way to understand the trade-offs for memory-bound applications in the presence of these physical constraints. Through epoch profiling and optimization, one can significantly reduce the multidimensional design space for hardware/software optimization through the use of high-level model-driven techniques

    Dynamic Management of Multi-Core Processor Resources to Improve Energy Efficiency under Quality-of-Service Constraints

    Get PDF
    With the current technology trends, the number of computers and computation demand is increasing dramatically. In addition to different economic and environmental costs at a large scale, the operational time of battery-powered devices is dependent on how efficiently the computer processors consume energy. Computer processors generally consist of several processing cores and a hierarchy of cache memory that includes both private and shared cache capacity among the cores. A resource management algorithm can adjust the configuration of different core and cache resources at regular intervals during run-time, according to the dynamic characteristics of the workload. A typical resource management policy is to maximize performance, in terms of processing speed or throughput, without exceeding the power and thermal limits. However, this can lead to excessive energy expenditure since a higher performance does not necessarily increase the value of the outcome. For example, increasing the frame-rate of multi-media applications beyond a certain target will not improve user experience considerably. Therefore, applications should be associated with Quality-of-Service (QoS) targets. This way, the resource manager can search for configurations with minimum energy that does not violate the performance constraints of any application. To achieve this goal, we propose several resource management schemes as well as hardware and software techniques for performance and energy modeling, in three papers that constitute this thesis. In the first paper, we demonstrate that, in many cases, independent management of resources such as per-core dynamic voltage-frequency scaling (DVFS) and cache partitioning fails to save a considerable energy without causing any performance degradation. Therefore, we present a coordinated resource management algorithm that saves considerable energy by exploring different combinations of resource allocations to all applications, at regular intervals during run-time. This scheme is based on simplified analytical performance and energy models and a multi-level reduction technique for reducing the dimensions of the multi-core configuration space. In the second paper, we extend the coordinated resource management with dynamic adaptation of the core micro-architectural resources. This way, we include instruction- and memory-level parallelism, ILP and MLP, resp., in the resource trade-offs together with per-core DVFS and cache partitioning. This provides a powerful means to further improve energy savings. Additionally, to enable this scheme, we propose a hardware technique that improves the accuracy of performance and energy prediction for different core sizes and cache partitionings. Finally, in the third paper, we demonstrate that substantial improvements in energy savings are possible by allowing short-term deviations from the baseline performance target. We measure these deviations by introducing a parameter called slack. Based on this, we present Cooperative Slack Management (CSM) that finds opportunities to generate slack at low energy cost and utilize it later to save more energy in the same or even other processor cores. This way, we also ensure that the performance consistently remains ahead of the baseline target in every core

    GDP : using dataflow properties to accurately estimate interference-free performance at runtime

    Get PDF
    Multi-core memory systems commonly share resources between processors. Resource sharing improves utilization at the cost of increased inter-application interference which may lead to priority inversion, missed deadlines and unpredictable interactive performance. A key component to effectively manage multi-core resources is performance accounting which aims to accurately estimate interference-free application performance. Previously proposed accounting systems are either invasive or transparent. Invasive accounting systems can be accurate, but slow down latency-sensitive processes. Transparent accounting systems do not affect performance, but tend to provide less accurate performance estimates. We propose a novel class of performance accounting systems that achieve both performance-transparency and superior accuracy. We call the approach dataflow accounting, and the key idea is to track dynamic dataflow properties and use these to estimate interference-free performance. Our main contribution is Graph-based Dynamic Performance (GDP) accounting. GDP dynamically builds a dataflow graph of load requests and periods where the processor commits instructions. This graph concisely represents the relationship between memory loads and forward progress in program execution. More specifically, GDP estimates interference-free stall cycles by multiplying the critical path length of the dataflow graph with the estimated interference-free memory latency. GDP is very accurate with mean IPC estimation errors of 3.4% and 9.8% for our 4- and 8-core processors, respectively. When GDP is used in a cache partitioning policy, we observe average system throughput improvements of 11.9% and 20.8% compared to partitioning using the state-of-the-art Application Slowdown Model

    QoS Driven Coordinated Management of Resources to Save Energy in Multi-Core Systems

    Get PDF
    Reducing the energy consumption of computing systems is a necessary endeavor. However, saving energy should not come at the expense of degrading user experience. To this end, in this thesis, we assume that applications running on multi-core processors are associated with a quality-of-service (QoS) target in terms of performance constraints. This way, hardware resources can be throttled to minimize energy expenditure without violating the QoS requirements. Typical resource management schemes control different resources such as processor cores and on-chip cache memory independently. These approaches are not effective under performance constraints for all applications. Therefore, this thesis presents multi-core resource management schemes that coordinately control several resources in a unified algorithm. This way, the resource manger can find trade-offs between resource allocations to different applications to reduce system-level energy consumption, while still meeting the QoS targets expressed as performance constraints for every application. Implementing a coordinated resource management scheme that dynamically adapts to varying run time behavior of a multi-programmed workload without any prior knowledge about the applications is a challenging task. Two different schemes are presented in this thesis to address this challenge. Both schemes are invoked at regular intervals during program execution. They employ simple and, yet, sufficiently accurate analytical models and a novel hardware technique to predict the effect of different resource allocations on performance and energy for each application. Using a heuristic method, the multi-dimensional system configuration space is pruned in several levels to find the optimum resource settings, with respect to energy efficiency, in a negligible time. In the first scheme a resource management algorithm is presented that coordinates the control of voltage-frequency (VF) of each processor core with partitioning of the on-chip cache space. In the second scheme, a re-configurable processor is considered in which sections of the core micro-architectural resources can be dynamically deactivated to save energy. The resource manager can reactivate these sections, at the proper time, to increase instruction and memory level parallelism (ILP/MLP). This introduces new trade-offs between processor core size, VF settings, and the allocation of cache space for each application. By exploiting these trade-offs, the second scheme improves the energy savings compared to the first scheme considerably. The proposed schemes are evaluated using a novel simulation framework. This framework estimates the effect of different resource management algorithms on full execution of benchmark applications in a multi-programmed workload. According to the experimental results, the proposed schemes can save up to 18% of system energy while respecting the performance constraints of all applications. The average energy savings are 6% and 10% with the first and second schemes, respectively. Further experiments on the first scheme shows that energy savings can potentially improve up to 29% if the users can tolerate a bounded reduction in performance that leads to 40% longer execution time

    Ubik: efficient cache sharing with strict qos for latency-critical workloads

    Get PDF
    Chip-multiprocessors (CMPs) must often execute workload mixes with different performance requirements. On one hand, user-facing, latency-critical applications (e.g., web search) need low tail (i.e., worst-case) latencies, often in the millisecond range, and have inherently low utilization. On the other hand, compute-intensive batch applications (e.g., MapReduce) only need high long-term average performance. In current CMPs, latency-critical and batch applications cannot run concurrently due to interference on shared resources. Unfortunately, prior work on quality of service (QoS) in CMPs has focused on guaranteeing average performance, not tail latency. In this work, we analyze several latency-critical workloads, and show that guaranteeing average performance is insufficient to maintain low tail latency, because microarchitectural resources with state, such as caches or cores, exert inertia on instantaneous workload performance. Last-level caches impart the highest inertia, as workloads take tens of milliseconds to warm them up. When left unmanaged, or when managed with conventional QoS frameworks, shared last-level caches degrade tail latency significantly. Instead, we propose Ubik, a dynamic partitioning technique that predicts and exploits the transient behavior of latency-critical workloads to maintain their tail latency while maximizing the cache space available to batch applications. Using extensive simulations, we show that, while conventional QoS frameworks degrade tail latency by up to 2.3x, Ubik simultaneously maintains the tail latency of latency-critical workloads and significantly improves the performance of batch applications.United States. Defense Advanced Research Projects Agency (Power Efficiency Revolution For Embedded Computing Technologies Contract HR0011-13-2-0005)National Science Foundation (U.S.) (Grant CCF-1318384

    Adaptive Resource Management Techniques for High Performance Multi-Core Architectures

    Get PDF
    Reducing the average memory access time is crucial for improving the performance of applications executing on multi-core architectures. With workload consolidation this becomes increasingly challenging due to shared resource contention. Previous works has proposed techniques for partitioning of shared resources (e.g. cache and bandwidth) and prefetch throttling with the goal of mitigating contention and reducing or hiding average memory access time.Cache partitioning in multi-core architectures is challenging due to the need to determine cache allocations with low computational overhead and the need to place the partitions in a locality-aware manner. The requirement for low computational overhead is important in order to have the capability to scale to large core counts. Previous work within multi-resource management has proposed coordinately managing a subset of the techniques: cache partitioning, bandwidth partitioning and prefetch throttling. However, coordinated management of all three techniques opens up new possible trade-offs and interactions which can be leveraged to gain better performance. This thesis contributes with two different resource management techniques: One resource manger for scalable cache partitioning and a multi-resource management technique for coordinated management of cache partitioning, bandwidth partitioning and prefetching. The scalable resource management technique for cache partitioning uses a distributed and asynchronous cache partitioning algorithm that works together with a flexible NUCA enforcement mechanism in order to give locality-aware placement of data and support fine-grained partitions. The algorithm adapts quickly to application phase changes. The distributed nature of the algorithm together with the low computational complexity, enables the solution to be implemented in hardware and scale to large core counts. The multi-resource management technique for coordinated management of cache partitioning bandwidth partitioning and prefetching is designed using the results from our in-depth characterisation from the entire SPEC CPU2006 suite. The solution consists of three local resource management techniques that together with a coordination mechanism provides allocations which takes the inter-resource interactions and trade-offs into account.Our evaluation shows that the distributed cache partitioning solution performs within 1% from the best known centralized solution, which cannot scale to large core counts. The solution improves performance by 9% and 16%, on average, on a 16 and 64-core multi-core architecture, respectively, compared to a shared last-level cache. The multi-resource management technique gives a performance increase of 11%, on average, over state-of-the-art and improves performance by 50% compared to the baseline 16-core multi-core without cache partitioning, bandwidth partitioning and prefetch throttling

    Efficient memory-level parallelism extraction with decoupled strands

    Get PDF
    We present Outrider, an architecture for throughput-oriented processors that exploits intra-thread memory-level parallelism (MLP) to improve performance efficiency on highly threaded workloads. Outrider enables a single thread of execution to be presented to the architecture as multiple decoupled instruction streams, consisting of either memory accessing or memory consuming instructions. The key insight is that by decoupling the instruction streams, the processor pipeline can expose MLP in a way similar to out-of-order designs while relying on a low-complexity in-order micro-architecture. Instead of adding more threads as is done in modern GPUs, Outrider can expose the same MLP with fewer threads and reduced contention for resources shared among threads. We demonstrate that Outrider can outperform single-threaded cores by 23-131% and a 4-way simultaneous multi-threaded core by up to 87% in data parallel applications in a 1024-core system. Outrider achieves these performance gains without incurring the overhead of additional hardware thread contexts, which results in improved efficiency compared to a multi-threaded core
    • …
    corecore