370 research outputs found

    Adaptive Resource Management Techniques for High Performance Multi-Core Architectures

    Get PDF
    Reducing the average memory access time is crucial for improving the performance of applications executing on multi-core architectures. With workload consolidation this becomes increasingly challenging due to shared resource contention. Previous works has proposed techniques for partitioning of shared resources (e.g. cache and bandwidth) and prefetch throttling with the goal of mitigating contention and reducing or hiding average memory access time.Cache partitioning in multi-core architectures is challenging due to the need to determine cache allocations with low computational overhead and the need to place the partitions in a locality-aware manner. The requirement for low computational overhead is important in order to have the capability to scale to large core counts. Previous work within multi-resource management has proposed coordinately managing a subset of the techniques: cache partitioning, bandwidth partitioning and prefetch throttling. However, coordinated management of all three techniques opens up new possible trade-offs and interactions which can be leveraged to gain better performance. This thesis contributes with two different resource management techniques: One resource manger for scalable cache partitioning and a multi-resource management technique for coordinated management of cache partitioning, bandwidth partitioning and prefetching. The scalable resource management technique for cache partitioning uses a distributed and asynchronous cache partitioning algorithm that works together with a flexible NUCA enforcement mechanism in order to give locality-aware placement of data and support fine-grained partitions. The algorithm adapts quickly to application phase changes. The distributed nature of the algorithm together with the low computational complexity, enables the solution to be implemented in hardware and scale to large core counts. The multi-resource management technique for coordinated management of cache partitioning bandwidth partitioning and prefetching is designed using the results from our in-depth characterisation from the entire SPEC CPU2006 suite. The solution consists of three local resource management techniques that together with a coordination mechanism provides allocations which takes the inter-resource interactions and trade-offs into account.Our evaluation shows that the distributed cache partitioning solution performs within 1% from the best known centralized solution, which cannot scale to large core counts. The solution improves performance by 9% and 16%, on average, on a 16 and 64-core multi-core architecture, respectively, compared to a shared last-level cache. The multi-resource management technique gives a performance increase of 11%, on average, over state-of-the-art and improves performance by 50% compared to the baseline 16-core multi-core without cache partitioning, bandwidth partitioning and prefetch throttling

    Smart hardware designs for probabilistically-analyzable processor architectures

    Get PDF
    Future Critical Real-Time Embedded Systems (CRTES), like those is planes, cars or trains, require more and more guaranteed performance in order to satisfy the increasing performance demands of advanced complex software features. While increased performance can be achieved by deploying processor techniques currently used in High-Performance Computing (HPC) and mainstream domains, their use challenges the software timing analysis, a necessary step in CRTES' verification and validation. Cache memories are known to have high impact in performance, and in fact, current CRTES include multicores usually with several levels of cache. In this line, this Thesis aims at increasing the guaranteed performance of CRTES by using techniques for caches building upon time randomization and providing probabilistic guarantees of tasks' execution time. In this Thesis, we first focus on on improving cache placement and replacement to improve guaranteed performance. For placement, different existing policies are explored in a multi-level cache setup, and a solution is reached in which different of those policies are combined. For cache replacement, we analyze a pathological scenario that no cache policy so far accounts and propose several policies that fix this pathological scenario. For shared caches in multicore we observe that contention is mainly caused by private writes that go through to the shared cache, yet using a pure write-back policy also has its drawbacks. We propose a hybrid approach to mitigate this contention. Building on this solution, the next contribution tackles a problem caused by the need of some reliability mechanisms in CRTES. Implementing reliability close to the processor's core has a significant impact in performance. A look-ahead error detection solution is proposed to greatly mitigate the performance impact. The next contribution proposes the first hardware prefetcher for CRTES with arbitrary cache hierarchies. Given its speculative nature, prefetchers that have a guaranteed positive impact on performance are difficult to design. We present a framework that provides execution time guarantees and obtains a performance benefit. Finally, we focus on the impact of timing anomalies in CRTES with caches. For the first time, a definition and taxonomy of timing anomalies is given for Measurement-Based Timing Analysis. Then, we focus on a specific timing anomaly that can happen with caches and provide a solution to account for it in the execution time estimates.Los Sistemas Empotrados de Tiempo-Real Crítico (SETRC), como los de los aviones, coches o trenes, requieren más y más rendimiento garantizado para satisfacer la demanda al alza de rendimiento para funciones complejas y avanzadas de software. Aunque el incremento en rendimiento puede ser adquirido utilizando técnicas de arquitectura de procesadores actualmente utilizadas en la Computación de Altas Prestaciones (CAP) i en los dominios convencionales, este uso presenta retos para el análisis del tiempo de software, un paso necesario en la verificación y validación de SETRC. Las memorias caches son conocidas por su gran impacto en rendimiento y, de hecho, los actuales SETRC incluyen multicores normalmente con diversos niveles de cache. En esta línea, esta Tesis tiene como objetivo mejorar el rendimiento garantizado de los SETRC utilizando técnicas para caches y utilizando métodos como la randomización del tiempo y proveyendo garantías probabilísticas de tiempo de ejecución de las tareas. En esta Tesis, primero nos centramos en mejorar la colocación y el reemplazo de caches para mejorar el rendimiento garantizado. Para la colocación, diferentes políticas son exploradas en un sistema cache multi-nivel, y se llega a una solución donde diversas de estas políticas son combinadas. Para el reemplazo, analizamos un escenario patológico que ninguna política actual tiene en cuenta, y proponemos varias políticas que solucionan este escenario patológico. Para caches compartidas en multicores, observamos que la contención es causada principalmente por escrituras privadas que van a través de la cache compartida, pero usar una política de escritura retardada pura también tiene sus consecuencias. Proponemos un enfoque híbrido para mitigar la contención. Sobre esta solución, la siguiente contribución ataca un problema causado por la necesidad de mecanismos de fiabilidad en SETRC. Implementar fiabilidad cerca del núcleo del procesador tiene un impacto significativo en rendimiento. Una solución basada en anticipación se propone para mitigar el impacto en rendimiento. La siguiente contribución propone el primer prefetcher hardware para SETRC con una jerarquía de caches arbitraria. Por primera vez, se da una definición y taxonomía de anomalías temporales para Análisis Temporal Basado en Medidas. Después, nos centramos en una anomalía temporal concreta que puede pasar con caches y ofrecemos una solución que la tiene en cuenta en las estimaciones del tiempo de ejecución.Postprint (published version

    Hermes: Accelerating Long-Latency Load Requests via Perceptron-Based Off-Chip Load Prediction

    Full text link
    Long-latency load requests continue to limit the performance of high-performance processors. To increase the latency tolerance of a processor, architects have primarily relied on two key techniques: sophisticated data prefetchers and large on-chip caches. In this work, we show that: 1) even a sophisticated state-of-the-art prefetcher can only predict half of the off-chip load requests on average across a wide range of workloads, and 2) due to the increasing size and complexity of on-chip caches, a large fraction of the latency of an off-chip load request is spent accessing the on-chip cache hierarchy. The goal of this work is to accelerate off-chip load requests by removing the on-chip cache access latency from their critical path. To this end, we propose a new technique called Hermes, whose key idea is to: 1) accurately predict which load requests might go off-chip, and 2) speculatively fetch the data required by the predicted off-chip loads directly from the main memory, while also concurrently accessing the cache hierarchy for such loads. To enable Hermes, we develop a new lightweight, perceptron-based off-chip load prediction technique that learns to identify off-chip load requests using multiple program features (e.g., sequence of program counters). For every load request, the predictor observes a set of program features to predict whether or not the load would go off-chip. If the load is predicted to go off-chip, Hermes issues a speculative request directly to the memory controller once the load's physical address is generated. If the prediction is correct, the load eventually misses the cache hierarchy and waits for the ongoing speculative request to finish, thus hiding the on-chip cache hierarchy access latency from the critical path of the off-chip load. Our evaluation shows that Hermes significantly improves performance of a state-of-the-art baseline. We open-source Hermes.Comment: To appear in 55th IEEE/ACM International Symposium on Microarchitecture (MICRO), 202

    Best-Offset Hardware Prefetching

    Get PDF
    International audienceHardware prefetching is an important feature of modern high-performance processors. When the application working set is too large to fit in on-chip caches, disabling hardware prefetchers may result in severe performance reduction. A new prefetcher was recently introduced, the Sandbox prefetcher, that tries to find dynamically the best prefetch offset using the sandbox method. The Sandbox prefetcher uses simple hardware and was shown to be quite effective. However, the sandbox method does not take into account prefetch timeliness. We propose an offset prefetcher with a new method for selecting the prefetch offset that takes into account prefetch timeliness. We show that our Best-Offset prefetcher outperforms the Sandbox prefetcher on the SPEC CPU2006 benchmarks , with equally simple hardware

    Adaptive Prefetching and Cache Partitioning for Multicore Processors

    Get PDF
    El acceso a la memoria principal en los procesadores actuales supone un importante cuello de botella para las prestaciones, dado que los diferentes núcleos compiten por el limitado ancho de banda de memoria, agravando la brecha entre las prestaciones del procesador y las de la memoria principal. Distintas técnicas atacan este problema, siendo las más relevantes el uso de jerarquías de caché multinivel y la prebúsqueda. Las cachés jerárquicas aprovechan la localidad temporal y espacial que en general presentan los programas en el acceso a los datos, para mitigar las enormes latencias de acceso a memoria principal. Para limitar el número de accesos a la memoria DRAM, fuera del chip, los procesadores actuales cuentan con grandes cachés de último nivel (LLC). Para mejorar su utilización y reducir costes, estas cachés suelen compartirse entre todos los núcleos del procesador. Este enfoque mejora significativamente el rendimiento de la mayoría de las aplicaciones en comparación con el uso de cachés privados más pequeños. Compartir la caché, sin embargo, presenta una problema importante: la interferencia entre aplicaciones. La prebúsqueda, por otro lado, trae bloques de datos a las cachés antes de que el procesador los solicite, ocultando la latencia de memoria principal. Desafortunadamente, dado que la prebúsqueda es una técnica especulativa, si no tiene éxito puede contaminar la caché con bloques que no se usarán. Además, las prebúsquedas interfieren con los accesos a memoria normales, tanto los del núcleo que emite las prebúsquedas como los de los demás. Esta tesis se centra en reducir la interferencia entre aplicaciones, tanto en las caché compartidas como en el acceso a la memoria principal. Para reducir la interferencia entre aplicaciones en el acceso a la memoria principal, el mecanismo propuesto en esta disertación regula la agresividad de cada prebuscador, activando o desactivando selectivamente algunos de ellos, dependiendo de su rendimiento individual y de los requisitos de ancho de banda de memoria principal de los otros núcleos. Con respecto a la interferencia en cachés compartidos, esta tesis propone dos técnicas de particionado para la LLC, las cuales otorgan más espacio de caché a las aplicaciones que progresan más lentamente debido a la interferencia entre aplicaciones. La primera propuesta de particionado de caché requiere hardware específico no disponible en procesadores comerciales, por lo que se ha evaluado utilizando un entorno de simulación. La segunda propuesta de particionado de caché presenta una familia de políticas que superan las limitaciones en el número de particiones y en el número de vías de caché disponibles mediante la agrupación de aplicaciones en clústeres y la superposición de particiones de caché, por lo que varias aplicaciones comparten las mismas vías. Dado que se ha implementado utilizando los mecanismos para el particionado de la LLC que presentan algunos procesadores Intel modernos, esta propuesta ha sido evaluada en una máquina real. Los resultados experimentales muestran que el mecanismo de prebúsqueda selectiva propuesto en esta tesis reduce el número de solicitudes de memoria principal en un 20%, cosa que se traduce en mejoras en la equidad del sistema, el rendimiento y el consumo de energía. Por otro lado, con respecto a los esquemas de partición propuestos, en comparación con un sistema sin particiones, ambas propuestas reducen la iniquidad del sistema en un promedio de más del 25%, independientemente de la cantidad de aplicaciones en ejecución, y esta reducción en la injusticia no afecta negativamente al rendimiento.Accessing main memory represents a major performance bottleneck in current processors, since the different cores compete among them for the limited offchip bandwidth, aggravating even more the so called memory wall. Several techniques have been applied to deal with the core-memory performance gap, with the most preeminent ones being prefetching and hierarchical caching. Hierarchical caches leverage the temporal and spacial locality of the accessed data, mitigating the huge main memory access latencies. To limit the number of accesses to the off-chip DRAM memory, current processors feature large Last Level Caches. These caches are shared between all the cores to improve the utilization of the cache space and reduce cost. This approach significantly improves the performance of most applications compared to using smaller private caches. Cache sharing, however, presents an important shortcoming: the interference between applications. Prefetching, on the other hand, brings data blocks to the caches before they are requested, hiding the main memory latency. Unfortunately, since prefetching is a speculative technique, inaccurate prefetches may pollute the cache with blocks that will not be used. In addition, the prefetches interfere with the regular memory requests, both the ones from the application running on the core that issued the prefetches and the others. This thesis focuses on reducing the inter-application interference, both in the shared cache and in the access to the main memory. To reduce the interapplication interference in the access to main memory, the proposed approach regulates the aggressiveness of each core prefetcher, and selectively activates or deactivates some of them, depending on their individual performance and the main memory bandwidth requirements of the other cores. With respect to interference in shared caches, this thesis proposes two LLC partitioning techniques that give more cache space to the applications that have their progress diminished due inter-application interferences. The first cache partitioning proposal requires dedicated hardware not available in commercial processors, so it has been evaluated using a simulation framework. The second proposal dealing with cache partitioning presents a family of partitioning policies that overcome the limitations in the number of partitions and the number of available ways by grouping applications and overlapping cache partitions, so multiple applications share the same ways. Since it has been implemented using the cache partitioning features of modern Intel processors it has been evaluated in a real machine. Experimental results show that the proposed selective prefetching mechanism reduces the number of main memory requests by 20%, which translates to improvements in unfairness, performance, and energy consumption. On the other hand, regarding the proposed partitioning schemes, compared to a system with no partitioning, both reduce unfairness more than 25% on average, regardless of the number of applications running in the multicore, and this reduction in unfairness does not negatively affect the performance.L'accés a la memòria principal en els processadors actuals suposa un important coll d'ampolla per a les prestacions, ja que els diferents nuclis competeixen pel limitat ample de banda de memòria, agreujant la bretxa entre les prestacions del processador i les de la memòria principal. Diferents tècniques ataquen aquest problema, sent les més rellevants l'ús de jerarquies de memòria cau multinivell i la prebusca. Les memòries cau jeràrquiques aprofiten la localitat temporal i espacial que en general presenten els programes en l'accés a les dades per mitigar les enormes latències d'accés a memòria principal. Per limitar el nombre d'accessos a la memòria DRAM, fora del xip, els processadors actuals compten amb grans caus d'últim nivell (LLC). Per millorar la seva utilització i reduir costos, aquestes memòries cau solen compartir-se entre tots els nuclis del processador. Aquest enfocament millora significativament el rendiment de la majoria de les aplicacions en comparació amb l'ús de caus privades més menudes. Compartir la memòria cau, no obstant, presenta una problema important: la interferencia entre aplicacions. La prebusca, per altra banda, porta blocs de dades a les memòries cau abans que el processador els sol·licite, ocultant la latència de memòria principal. Desafortunadament, donat que la prebusca és una técnica especulativa, si no té èxit pot contaminar la memòria cau amb blocs que no fan falta. A més, les prebusques interfereixen amb els accessos normals a memòria, tant els del nucli que emet les prebusques com els dels altres. Aquesta tesi es centra en reduir la interferència entre aplicacions, tant en les cau compartides com en l'accés a la memòria principal. Per reduir la interferència entre aplicacions en l'accés a la memòria principal, el mecanismo proposat en aquesta dissertació regula l'agressivitat de cada prebuscador, activant o desactivant selectivament alguns d'ells, en funció del seu rendiment individual i dels requisits d'ample de banda de memòria principal dels altres nuclis. Pel que fa a la interferència en caus compartides, aquesta tesi proposa dues tècniques de particionat per a la LLC, les quals atorguen més espai de memòria cau a les aplicacions que progressen més lentament a causa de la interferència entre aplicacions. La primera proposta per al particionat de memòria cau requereix hardware específic no disponible en processadors comercials, per la qual cosa s'ha avaluat utilitzant un entorn de simulació. La segona proposta de particionat per a memòries cau presenta una família de polítiques que superen les limitacions en el nombre de particions i en el nombre de vies de memòria cau disponibles mitjan¿ cant l'agrupació d'aplicacions en clústers i la superposició de particions de memòria cau, de manera que diverses aplicacions comparteixen les mateixes vies. Atès que s'ha implementat utilitzant els mecanismes per al particionat de la LLC que ofereixen alguns processadors Intel moderns, aquesta proposta s'ha avaluat en una màquina real. Els resultats experimentals mostren que el mecanisme de prebusca selectiva proposat en aquesta tesi redueix el nombre de sol·licituds a la memòria principal en un 20%, cosa que es tradueix en millores en l'equitat del sistema, el rendiment i el consum d'energia. Per altra banda, pel que fa als esquemes de particiónat proposats, en comparació amb un sistema sense particions, ambdues propostes redueixen la iniquitat del sistema en més d'un 25% de mitjana, independentment de la quantitat d'aplicacions en execució, i aquesta reducció en la iniquitat no afecta negativament el rendiment.Selfa Oliver, V. (2018). Adaptive Prefetching and Cache Partitioning for Multicore Processors [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/112423TESI

    Adaptive Microarchitectural Optimizations to Improve Performance and Security of Multi-Core Architectures

    Get PDF
    With the current technological barriers, microarchitectural optimizations are increasingly important to ensure performance scalability of computing systems. The shift to multi-core architectures increases the demands on the memory system, and amplifies the role of microarchitectural optimizations in performance improvement. In a multi-core system, microarchitectural resources are usually shared, such as the cache, to maximize utilization but sharing can also lead to contention and lower performance. This can be mitigated through partitioning of shared caches.However, microarchitectural optimizations which were assumed to be fundamentally secure for a long time, can be used in side-channel attacks to exploit secrets, as cryptographic keys. Timing-based side-channels exploit predictable timing variations due to the interaction with microarchitectural optimizations during program execution. Going forward, there is a strong need to be able to leverage microarchitectural optimizations for performance without compromising security. This thesis contributes with three adaptive microarchitectural resource management optimizations to improve security and/or\ua0performance\ua0of multi-core architectures\ua0and a systematization-of-knowledge of timing-based side-channel attacks.\ua0We observe that to achieve high-performance cache partitioning in a multi-core system\ua0three requirements need to be met: i) fine-granularity of partitions, ii) locality-aware placement and iii) frequent changes. These requirements lead to\ua0high overheads for current centralized partitioning solutions, especially as the number of cores in the\ua0system increases. To address this problem, we present an adaptive and scalable cache partitioning solution (DELTA) using a distributed and asynchronous allocation algorithm. The\ua0allocations occur through core-to-core challenges, where applications with larger performance benefit will gain cache capacity. The\ua0solution is implementable in hardware, due to low computational complexity, and can scale to large core counts.According to our analysis, better performance can be achieved by coordination of multiple optimizations for different resources, e.g., off-chip bandwidth and cache, but is challenging due to the increased number of possible allocations which need to be evaluated.\ua0Based on these observations, we present a solution (CBP) for coordinated management of the optimizations: cache partitioning, bandwidth partitioning and prefetching.\ua0Efficient allocations, considering the inter-resource interactions and trade-offs, are achieved using local resource managers to limit the solution space.The continuously growing number of\ua0side-channel attacks leveraging\ua0microarchitectural optimizations prompts us to review attacks and defenses to understand the vulnerabilities of different microarchitectural optimizations. We identify the four root causes of timing-based side-channel attacks: determinism, sharing, access violation\ua0and information flow.\ua0Our key insight is that eliminating any of the exploited root causes, in any of the attack steps, is enough to provide protection.\ua0Based on our framework, we present a systematization of the attacks and defenses on a wide range of microarchitectural optimizations, which highlights their key similarities.\ua0Shared caches are an attractive attack surface for side-channel attacks, while defenses need to be efficient since the cache is crucial for performance.\ua0To address this issue, we present an adaptive and scalable cache partitioning solution (SCALE) for protection against cache side-channel attacks. The solution leverages randomness,\ua0and provides quantifiable and information theoretic security guarantees using differential privacy. The solution closes the performance gap to a state-of-the-art non-secure allocation policy for a mix of secure and non-secure applications

    Systems for Challenged Network Environments.

    Full text link
    Developing regions face significant challenges in network access, making even simple network tasks unpleasant and rich media prohibitively difficult to access. Even as cellular network coverage is approaching a near-universal reach, good network connectivity remains scarce and expensive in many emerging markets. The underlying theme in this dissertation is designing network systems that better accommodate users in emerging markets. To do so, this dissertation begins with a nuanced analysis of content access behavior for web users in developing regions. This analysis finds the personalization of content access---and the fragmentation that results from it---to be significant factors in undermining many existing web acceleration mechanisms. The dissertation explores content access behavior from logs collected at shared internet access sites, as well as user activity information obtained from a commercial social networking service with over a hundred million members worldwide. Based on these observations, the dissertation then discusses two systems designed for improving end-user experience in accessing and using content in constrained networks. First, it deals with the challenge of distributing private content in these networks. By leveraging the wide availability of cellular telephones, the dissertation describes a system for personal content distribution based on user access behavior. The system enables users to request future data accesses, and it schedules content transfers according to current and expected capacity. Second, the dissertation looks at routing bulk data in challenged networks, and describes an experimentation platform for building systems for challenged networks. This platform enables researchers to quickly prototype systems for challenged networks, and iteratively evaluate these systems using mobility and network emulation. The dissertation describes a few data routing systems that were built atop this experimentation platform. Finally, the dissertation discusses the marketplace and service discovery considerations that are important in making these systems viable for developing-region use. In particular, it presents an extensible, auction-based market platform that relies on widely available communication tools for conveniently discovering and trading digital services and goods in developing regions. Collectively, this dissertation brings together several projects that aim to understand and improve end-user experience in challenged networks endemic to developing regions.Ph.D.Computer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/91401/1/azarias_1.pd

    Contextual Bandit Modeling for Dynamic Runtime Control in Computer Systems

    Get PDF
    Modern operating systems and microarchitectures provide a myriad of mechanisms for monitoring and affecting system operation and resource utilization at runtime. Dynamic runtime control of these mechanisms can tailor system operation to the characteristics and behavior of the current workload, resulting in improved performance. However, developing effective models for system control can be challenging. Existing methods often require extensive manual effort, computation time, and domain knowledge to identify relevant low-level performance metrics, relate low-level performance metrics and high-level control decisions to workload performance, and to evaluate the resulting control models. This dissertation develops a general framework, based on the contextual bandit, for describing and learning effective models for runtime system control. Random profiling is used to characterize the relationship between workload behavior, system configuration, and performance. The framework is evaluated in the context of two applications of progressive complexity; first, the selection of paging modes (Shadow Paging, Hardware-Assisted Page) in the Xen virtual machine memory manager; second, the utilization of hardware memory prefetching for multi-core, multi-tenant workloads with cross-core contention for shared memory resources, such as the last-level cache and memory bandwidth. The resulting models for both applications are competitive in comparison to existing runtime control approaches. For paging mode selection, the resulting model provides equivalent performance to the state of the art while substantially reducing the computation requirements of profiling. For hardware memory prefetcher utilization, the resulting models are the first to provide dynamic control for hardware prefetchers using workload statistics. Finally, a correlation-based feature selection method is evaluated for identifying relevant low-level performance metrics related to hardware memory prefetching
    corecore