Search CORE

51 research outputs found

Addressing fairness in SMT multicores with a progress-aware scheduler

Author: Duato Marín José Francisco
Feliu Pérez Josué
Petit Martí Salvador Vicente
Sahuquillo Borrás Julio
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 25/05/2015
Field of study

© 2015 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.Current SMT (simultaneous multithreading) processors co-schedule jobs on the same core, thus sharing core resources like L1 caches. In SMT multicores, threads also compete among themselves for uncore resources like the LLC (last level cache) and DRAM modules. Per process performance degradation over isolated execution mainly depends on process resource requirements and the resource contention induced by co-runners. Consequently, the running processes progress at different pace. If schedulers are not progress aware, the unpredictable execution time caused by unfairness can introduce undesirable behaviors on the system such as difficulties to keep priority-based scheduling. This work proposes a job scheduler for SMT multicores that provides fairness to the execution of multiprogrammed workloads. To this end, the scheduler estimates per-process standalone performance by periodically creating low-contention co-schedules. These estimates are used to compute the per process progress. Then, those processes with less progress are prioritized to enhance fairness. Experimental results on a Intel Xeon with six dual-threaded SMT cores show that the proposed scheduler reduces unfairness, on average, by 3× over Linux OS. Moreover, thanks to the tread to core allocation policy, the scheduler slightly improves throughput and turnaround time.This work was supported by the Spanish Ministerio de Econom´ıa y Competitividad (MINECO) and Plan E funds, under Grant TIN2012-38341-C04-01, and by the Intel Early Career Faculty Honor Program AwardFeliu Pérez, J.; Sahuquillo Borrás, J.; Petit Martí, SV.; Duato Marín, JF. (2015). Addressing fairness in SMT multicores with a progress-aware scheduler. IEEE. https://doi.org/10.1109/IPDPS.2015.48

Crossref

RiuNet

Perf&Fair: A Progress-Aware Scheduler to Enhance Performance and Fairness in SMT Multicores

Author: Duato Marín José Francisco
Feliu-Pérez Josué
Petit Martí Salvador Vicente
Sahuquillo Borrás Julio
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2017
Field of study

[EN] Nowadays, high performance multicore processors implement multithreading capabilities. The processes running concurrently on these processors are continuously competing for the shared resources, not only among cores, but also within the core. While resource sharing increases the resource utilization, the interference among processes accessing the shared resources can strongly affect the performance of individual processes and its predictability. In this scenario, process scheduling plays a key role to deal with performance and fairness. In this work we present a process scheduler for SMT multicores that simultaneously addresses both performance and fairness. This is a major design issue since scheduling for only one of the two targets tends to damage the other. To address performance, the scheduler tackles bandwidth contention at the L1 cache and main memory. To deal with fairness, the scheduler estimates the progress experienced by the processes, and gives priority to the processes with lower accumulated progress. Experimental results on an Intel Xeon E5645 featuring six dual-threaded SMT cores show that the proposed scheduler improves both performance and fairness over two state-of-the-art schedulers and the Linux OS scheduler. Compared to Linux, unfairness is reduced to a half while still improving performance by 5.6 percent.We thank the anonymous reviewers for their constructive and insightful feedback. This work was supported in part by the Spanish Ministerio de Economia y Competitividad (MINECO) and Plan E funds, under grants TIN2015-66972-C5-1-R and TIN2014-62246EXP, and by the Intel Early Career Faculty Honor Program Award.Feliu-Pérez, J.; Sahuquillo Borrás, J.; Petit Martí, SV.; Duato Marín, JF. (2017). Perf&Fair: A Progress-Aware Scheduler to Enhance Performance and Fairness in SMT Multicores. IEEE Transactions on Computers. 66(5):905-911. https://doi.org/10.1109/TC.2016.2620977S90591166

RiuNet

Contention-Aware Scheduling for SMT Multicore Processors

Author: Feliu Pérez Josué
Publication venue: 'Universitat Politecnica de Valencia'
Publication date: 27/03/2017
Field of study

The recent multicore era and the incoming manycore/manythread era generate a lot of challenges for computer scientists going from productive parallel programming, over network congestion avoidance and intelligent power management, to circuit design issues. The ultimate goal is to squeeze out as much performance as possible while limiting power and energy consumption and guaranteeing a reliable execution. The increasing number of hardware contexts of current and future systems makes the scheduler an important component to achieve this goal, as there is often a combinatorial amount of different ways to schedule the distinct threads or applications, each with a different performance due to the inter-application interference. Picking an optimal schedule can result in substantial performance gains. This thesis deals with inter-application interference, covering the problems this fact causes on performance and fairness on actual machines. The study starts with single-threaded multicore processors (Intel Xeon X3320), follows with simultaneous multithreading (SMT) multicores supporting up to two threads per core (Intel Xeon E5645), and goes to the most highly threaded per-core processor that has ever been built (IBM POWER8). The dissertation analyzes the main contention points of each experimental platform and proposes scheduling algorithms that tackle the interference arising at each contention point to improve the system throughput and fairness. First we analyze contention through the memory hierarchy of current multicore processors. The performed studies reveal high performance degradation due to contention on main memory and any shared cache the processors implement. To mitigate such contention, we propose different bandwidth-aware scheduling algorithms with the key idea of balancing the memory accesses through the workload execution time and the cache requests among the different caches at each cache level. The high interference that different applications suffer when running simultaneously on the same SMT core, however, does not only affect performance, but can also compromise system fairness. In this dissertation, we also analyze fairness in current SMT multicores. To improve system fairness, we design progress-aware scheduling algorithms that estimate, at runtime, how the processes progress, which allows to improve system fairness by prioritizing the processes with lower accumulated progress. Finally, this dissertation tackles inter-application contention in the IBM POWER8 system with a symbiotic scheduler that addresses overall SMT interference. The symbiotic scheduler uses an SMT interference model, based on CPI stacks, that estimates the slowdown of any combination of applications if they are scheduled on the same SMT core. The number of possible schedules, however, grows too fast with the number of applications and makes unfeasible to explore all possible combinations. To overcome this issue, the symbiotic scheduler models the scheduling problem as a graph problem, which allows finding the optimal schedule in reasonable time. In summary, this thesis addresses contention in the shared resources of the memory hierarchy and SMT cores of multicore processors. We identify the main contention points of three systems with different architectures and propose scheduling algorithms to tackle contention at these points. The evaluation on the real systems shows the benefits of the proposed algorithms. The symbiotic scheduler improves system throughput by 6.7\% over Linux. Regarding fairness, the proposed progress-aware scheduler reduces Linux unfairness to a third. Besides, since the proposed algorithm are completely software-based, they could be incorporated as scheduling policies in Linux and used in small-scale servers to achieve the mentioned benefits.La actual era multinúcleo y la futura era manycore/manythread generan grandes retos en el área de la computación incluyendo, entre otros, la programación paralela productiva o la gestión eficiente de la energía. El último objetivo es alcanzar las mayores prestaciones limitando el consumo energético y garantizando una ejecución confiable. El incremento del número de contextos hardware de los sistemas hace que el planificador se convierta en un componente importante para lograr este objetivo debido a que existen múltiples formas diferentes de planificar las aplicaciones, cada una con distintas prestaciones debido a las interferencias que se producen entre las aplicaciones. Seleccionar la planificación óptima puede proporcionar importantes mejoras de prestaciones. Esta tesis se ocupa de las interferencias entre aplicaciones, cubriendo los problemas que causan en las prestaciones y equidad de los sistemas actuales. El estudio empieza con procesadores multinúcleo monohilo (Intel Xeon X3320), sigue con multinúcleos con soporte para la ejecución simultanea (SMT) de dos hilos (Intel Xeon E5645), y llega al procesador que actualmente soporta un mayor número de hilos por núcleo (IBM POWER8). La disertación analiza los principales puntos de contención en cada plataforma y propone algoritmos de planificación que mitigan las interferencias que se generan en cada uno de ellos para mejorar la productividad y equidad de los sistemas. En primer lugar, analizamos la contención a lo largo de la jerarquía de memoria. Los estudios realizados revelan la alta degradación de prestaciones provocada por la contención en memoria principal y en cualquier cache compartida. Para mitigar esta contención, proponemos diversos algoritmos de planificación cuya idea principal es distribuir los accesos a memoria a lo largo del tiempo de ejecución de la carga y las peticiones a las caches entre las diferentes caches compartidas en cada nivel. Las altas interferencias que sufren las aplicaciones que se ejecutan simultáneamente en un núcleo SMT, sin embargo, no solo afectan a las prestaciones, sino que también pueden comprometer la equidad del sistema. En esta tesis, también abordamos la equidad en los actuales multinúcleos SMT. Para mejorarla, diseñamos algoritmos de planificación que estiman el progreso de las aplicaciones en tiempo de ejecución, lo que permite priorizar los procesos con menor progreso acumulado para reducir la inequidad. Finalmente, la tesis se centra en la contención entre aplicaciones en el sistema IBM POWER8 con un planificador simbiótico que aborda la contención en todo el núcleo SMT. El planificador simbiótico utiliza un modelo de interferencia basado en pilas de CPI que predice las prestaciones para la ejecución de cualquier combinación de aplicaciones en un núcleo SMT. El número de posibles planificaciones, no obstante, crece muy rápido y hace inviable explorar todas las posibles combinaciones. Por ello, el problema de planificación se modela como un problema de teoría de grafos, lo que permite obtener la planificación óptima en un tiempo razonable. En resumen, esta tesis aborda la contención en los recursos compartidos en la jerarquía de memoria y el núcleo SMT de los procesadores multinúcleo. Identificamos los principales puntos de contención de tres sistemas con diferentes arquitecturas y proponemos algoritmos de planificación para mitigar esta contención. La evaluación en sistemas reales muestra las mejoras proporcionados por los algoritmos propuestos. Así, el planificador simbiótico mejora la productividad, en promedio, un 6.7% con respecto a Linux. En cuanto a la equidad, el planificador que considera el progreso consigue reducir la inequidad de Linux a una tercera parte. Además, dado que los algoritmos propuestos son completamente software, podrían incorporarse como políticas de planificación en Linux y usarse en servidores a pequeña escala para obtener los benefiL'actual era multinucli i la futura era manycore/manythread generen grans reptes en l'àrea de la computació incloent, entre d'altres, la programació paral·lela productiva o la gestió eficient de l'energia. L'últim objectiu és assolir les majors prestacions limitant el consum energètic i garantint una execució confiable. L'increment del número de contextos hardware dels sistemes fa que el planificador es convertisca en un component important per assolir aquest objectiu donat que existeixen múltiples formes distintes de planificar les aplicacions, cadascuna amb unes prestacions diferents degut a les interferències que es produeixen entre les aplicacions. Seleccionar la planificació òptima pot donar lloc a millores importants de les prestacions. Aquesta tesi s'ocupa de les interferències entre aplicacions, cobrint els problemes que provoquen en les prestacions i l'equitat dels sistemes actuals. L'estudi comença amb processadors multinucli monofil (Intel Xeon X3320), segueix amb multinuclis amb suport per a l'execució simultània (SMT) de dos fils (Intel Xeon E5645), i arriba al processador que actualment suporta un major nombre de fils per nucli (IBM POWER8). Aquesta dissertació analitza els principals punts de contenció en cada plataforma i proposa algoritmes de planificació que aborden les interferències que es generen en cadascun d'ells per a millorar la productivitat i l'equitat dels sistemes. En primer lloc, estudiem la contenció al llarg de la jerarquia de memòria en els processadors multinucli. Els estudis realitzats revelen l'alta degradació de prestacions provocada per la contenció en memòria principal i en qualsevol cache compartida. Per a mitigar la contenció, proposem diversos algoritmes de planificació amb la idea principal de distribuir els accessos a memòria al llarg del temps d'execució de la càrrega i les peticions a les caches entre les diferents caches compartides en cada nivell. Les altes interferències que sofreixen las aplicacions que s'executen simultàniament en un nucli SMT, no obstant, no sols afecten a las prestacions, sinó que també poden comprometre l'equitat del sistema. En aquesta tesi, també abordem l'equitat en els actuals multinuclis SMT. Per a millorar-la, dissenyem algoritmes de planificació que estimen el progrés de les aplicacions en temps d'execució, el que permet prioritzar els processos amb menor progrés acumulat para a reduir la inequitat. Finalment, la tesi es centra en la contenció entre aplicacions en el sistema IBM POWER8 amb un planificador simbiòtic que aborda la contenció en tot el nucli SMT. El planificador simbiòtic utilitza un model d'interferència basat en piles de CPI que prediu les prestacions per a l'execució de qualsevol combinació d'aplicacions en un nucli SMT. El nombre de possibles planificacions, no obstant, creix molt ràpid i fa inviable explorar totes les possibles combinacions. Per resoldre aquest contratemps, el problema de planificació es modela com un problema de teoria de grafs, la qual cosa permet obtenir la planificació òptima en un temps raonable. En resum, aquesta tesi aborda la contenció en els recursos compartits en la jerarquia de memòria i el nucli SMT dels processadors multinucli. Identifiquem els principals punts de contenció de tres sistemes amb diferents arquitectures i proposem algoritmes de planificació per a mitigar aquesta contenció. L'avaluació en sistemes reals mostra les millores proporcionades pels algoritmes proposats. Així, el planificador simbiòtic millora la productivitat una mitjana del 6.7% respecte a Linux. Pel que fa a l'equitat, el planificador que considera el progrés aconsegueix reduir la inequitat de Linux a una tercera part. A més, donat que els algoritmes proposats son completament software, podrien incorporar-se com a polítiques de planificació en Linux i emprar-se en servidors a petita escala per obtenir els avantatges mencionats.Feliu Pérez, J. (2017). Contention-Aware Scheduling for SMT Multicore Processors [Tesis doctoral]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/79081TESISPremios Extraordinarios de tesis doctorale

Crossref

RiuNet

Resource Optimized Scheduling For Enhanced Power Efficiency And Throughput On Chip Multi Processor Platforms

Author: Kundan Shivam
Publication venue: OpenSIUC
Publication date: 01/05/2024
Field of study

The parallel nature of process execution on Chip Multi-Processors (CMPs) has boosted levels of application performance far beyond the capabilities of erstwhile single-core designs. Generally, CMPs offer improved performance by integrating multiple simpler cores onto a single die that share certain computing resources among them such as last-level caches, data buses, and main memory. This ensures architectural simplicity while also boosting performance for multi-threaded applications. However, a major trade-off associated with this approach is that concurrently executing applications incur performance degradation if their collective resource requirements exceed the total amount of resources available to the system. If dynamic resource allocation is not carefully considered, the potential performance gain from having multiple cores may be outweighed by the losses due to contention for allocation of shared resources. Additionally, CMPs with inbuilt dynamic voltage-frequency scaling (DVFS) mechanisms may try to compensate for the performance bottleneck by scaling to higher clock frequencies. For performance degradation due to shared-resource contention, this does not necessarily improve performance but does ensure a significant penalty on power consumption due to the quadratic relation of electrical power and voltage (P_dynamic ∝ V^2 * f).This dissertation presents novel methodologies for balancing the competing requirements of high performance, fairness of execution, and enforcement of priority, while also ensuring overall power efficiency of CMPs. Specifically, we (1) Analyze the problem of resource interference during concurrent process execution and propose two fine-grained scheduling methodologies for improving overall performance and fairness, (2) Develop an approach for enforcement of priority (i.e., minimum performance) for specific processes while avoiding resource starvation for others, and (3) Present a machine-learning approach for maximizing the power efficiency (performance-per-Watt) of CMPs through estimation of a workload\u27s performance and power consumption limits at different clock frequencies.As modern computing workloads become increasingly dynamic, and computers themselves become increasingly ubiquitous, the problem of finding the ideal balance between performance and power consumption of CMPs is of particular relevance today, especially given the unprecedented proliferation of embedded devices for use in Internet-of-Things, edge computing, smart wearables, and even exotic experiments such as space probes comprised entirely of a CMP, sensors, and an antenna ( space chips ). Additionally, reducing power consumption while maintaining constant performance can contribute to addressing the growing problem of dark silicon

OpenSIUC

Optimización de justicia y rendimiento en procesadores multicore asimétricos mediante planificación consciente de la contención

Author: García García Adrián
Publication venue
Publication date: 01/01/2018
Field of study

Los procesadores multicore asimétricos (AMPs) con repertorio común de instrucciones constituyen una alternativa de mayor eficiencia energética que los multicores simétricos para cargas de trabajo diversas. Los AMPs integran cores rápidos de alto rendimiento, con otros más lentos y de bajo consumo. Se ha demostrado que la planificación a nivel de sistema operativo y consciente de la asimetría es esencial para obtener beneficios significativos en cuanto a rendimiento global y para garantizar justicia en este tipo de sistemas. No obstante, para poder llevar esto a cabo, el planificador ha de estimar de forma precisa el progreso que cada hilo realiza al ejecutarse en los diversos tipos de core durante la ejecución. A pesar de la existencia de planificadores que optimizan la justicia o el rendimiento en AMPs, las propuestas existentes habitualmente dependen de extensiones hardware especiales o de modelos de predicción específicos de plataforma y, además, no tienen en cuenta la degradación del rendimiento asociada a la contención en los recursos compartidos (p.ej., caché compartida o bus de memoria). Esto puede limitar la portabilidad del planificador y producir una degradación significativa de la justicia y del rendimiento global. En este Trabajo de Fin de Máster se ha procedido al diseño e implementación en el kernel Linux de un planificador consciente de la contención en recursos compartidos en AMPs, que está orientado a la optimización de la justicia. Asimismo, el planificador expone un parámetro de configuración que permite mejorar gradualmente el rendimiento global a costa de degradar la justicia. La evaluación experimental del planificador propuesto se ha llevado a cabo utilizando hardware multicore asimétrico real

Docta Complutense

Planificación consciente de la contención y gestión de recursos en arquitecturas multicore emergentes

Author: García García Adrián
Publication venue: 'Universidad Complutense de Madrid (UCM)'
Publication date: 29/03/2022
Field of study

Tesis inédita de la Universidad Complutense de Madrid, Facultad de Informática, Departamento de Arquitectura de Computadores y Automática, leída el 14-12-2021Chip multicore processors (CMPs) currently constitute the architecture of choice for mosto general-pùrpose computing systems, and they will likely continue to be dominant in the near future. Advances in technology have enabled to pack an increasing number of cores and bigger caches on the same chip. Nevertheless, contention on shared resources on CMPs -present since the advent of these architectures- still poses a big challenge. Cores in a CMP typically share a last-level cache (LLC) and other memory-related resources with the remaining cores, such as a DRAM controller and an interconnection network. This causes that co-running applications may intensively compete with each other for these shared resources, leading to substantial and uneven performance degradation...Los procesadores multinúcleo o CMPs (Chip Multicore Processors) son actualmente la arquitectura más usada por la mayoría de sistemas de computación de propósito general, y muy probablemente se mantendrían en esa posición dominante en el futuro cercano. Los avances tecnológicos han permitido integrar progresivamente en el mismo chip más cores y aumentar los tamaños de los distintos niveles de cache. No obstante, la contención de recursos compartidos en CMPs {presente desde la aparición de estas arquitecturas{ todavía representa un reto importante que afrontar. Los cores en un CMP comparten en la mayor parte de los diseños una cache de último nivel o LLC (Last-Level Cache) y otros recursos, como el controlador de DRAM o una red de interconexión. La existencia de dichos recursos compartidos provoca en ocasiones que cuando se ejecutan dos o más aplicaciones simultáneamente en el sistema, se produzca una degradación sustancial y potencialmente desigual del rendimiento entre aplicaciones...Fac. de InformáticaTRUEunpu

Docta Complutense

Improving System Turnaround Time with Intel CAT by Identifying LLC Critical Applications

Author: Petit Martí Salvador Vicente
Pons Terol Julio
Pons-Escat Lucía
Sahuquillo Borrás Julio
Selfa Vicent
Publication venue: Springer
Publication date: 31/08/2018
Field of study

[EN] Resource sharing is a major concern in current multicore processors. Among the shared system resources, the Last Level Cache (LLC) is one of the most critical, since destructive interference between applications accessing it implies more off-chip accesses to main memory, which incur long latencies that can severely impact the overall system performance. To help alleviate this issue, current processors implement huge LLCs, but even so, inter-application interference can harm the performance of a subset of the running applications when executing multiprogram workloads. For this reason, recent Intel processors feature Cache Allocation Technologies (CAT) to partition the cache and assign subsets of cache ways to groups of applications. This paper proposes the Critical-Aware (CA) LLC partitioning approach, which leverages CAT and improves the performance of multiprogram workloads, by identifying and protecting the applications whose performance is more damaged by LLC sharing. Experimental results show that CA improves turnaround time on average by 15%, and up to 40% compared to a baseline system without partitioning.This work was supported by the Spanish Ministerio de Economia y Competitividad (MINECO) and Plan E funds, under grants TIN2015-66972-C5-1-R and TIN2017-92139-EXP. It was also supported by the ExaNest project, with funds from the European Union Horizon 2020 project, with grant agreement No 671553.Pons-Escat, L.; Selfa, V.; Sahuquillo Borrás, J.; Petit Martí, SV.; Pons Terol, J. (2018). Improving System Turnaround Time with Intel CAT by Identifying LLC Critical Applications. Springer. 603-615. https://doi.org/10.1007/978-3-319-96983-1_43603615Sodani, A., et al.: Knights landing: second-generation intel xeon phi product. IEEE Micro 36(2), 34–46 (2016)Qureshi, M.K., Patt, Y.N.: Utility-based cache partitioning: a low-overhead, high-performance, runtime mechanism to partition shared caches. In: Proceedings of MICRO, pp. 423–432 (2006)Manikantan, R., Rajan, K., Govindarajan, R.: Probabilistic shared cache management (PriSM). In: Proceedings of the 39th Annual International Symposium on Computer Architecture (ISCA), pp. 428–439 (2012)El-Sayed, N., Mukkara, A., Tsai, P.A., Kasture, H., Ma, X., Sanchez, D.: Kpart: a hybrid cache partitioning-sharing technique for commodity multicores. In: Proceedings of HPCA (2018)Selfa, V., Sahuquillo, J., Eeckhout, L., Petit, S., Gómez, M.E.: Application clustering policies to address system fairness with intel’s cache allocation technology. In: Proceedings of PACT, pp. 194–205 (2017)Feliu, J., Sahuquillo, J., Petit, S., Duato, J.: Addressing fairness in SMT multicores with a progress-aware scheduler. In: Proceedings of IPDPS, pp. 187–196 (2015)Van Craeynest, K., Akram, S., Heirman, W., Jaleel, A., Eeckhout, L.: Fairness-aware scheduling on single-ISA heterogeneous multi-cores. In: Proceedings of PACT, pp. 177–188 (2013)Wu, C., Li, J., Xu, D., Yew, P.C., Li, J., Wang, Z.: FPS: a fair-progress process scheduling policy on shared-memory multiprocessors. J. Trans. Parallel Distrib. Syst. 26(2), 444–454 (2015)Eyerman, S., Eeckhout, L.: System-level performance metrics for multiprogram workloads. IEEE Micro 28(3), 42–53 (2008)Henning, J.L.: SPEC CPU2006 benchmark descriptions. Comput. Archit. News 34(4), 1–17 (2006)Miller, J.: Short report: reaction time analysis with outlier exclusion: bias varies with sample size. J. Exp. Psychol. 43(4), 907–912 (1991)Leys, C., Ley, C., Klein, O., Bernard, P., Licata, L.: Detecting outliers: do not use standard deviation around the mean, use absolute deviation around the median. J. Exp. Soc. Psychol. 49(4), 764–766 (2013)Gleixner, T., Molnar, I.: Performance counters for Linux (2009)Van Craeynest, K., Akram, S., Heirman, W., Jaleel, A., Eeckhout, L.: Fairness-aware scheduling on single-ISA heterogeneous multi-cores. In: Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques. PACT 2013, Piscataway, NJ, USA, pp. 177–188. IEEE Press (2013)Lo, D., Cheng, L., Govindaraju, R., Ranganathan, P., Kozyrakis, C.: Heracles: improving resource efficiency at scale. In: Proceedings of ISCA, pp. 450–462 (2015)Zhu, H., Erez, M.: Dirigent: enforcing QoS for latency-critical tasks on shared multicore systems. In: Proceedings of ASPLOS, pp. 33–47 (2016)Funaro, L., Ben-Yehuda, O.A., Schuster, A.: Ginseng: market-driven LLC allocation. In: Proceedings of USENIX, pp. 295–308 (2016)Subramanian, L., Seshadri, V., Ghosh, A., Khan, S., Mutlu, O.: The application slowdown model: quantifying and controlling the impact of inter-application interference at shared caches and main memory. In: Proceedings of MICRO, pp. 62–75 (2015)Sanchez, D., Kozyrakis, C.: Vantage: scalable and efficient fine-grain cache partitioning. In: Proceedings of ISCA, pp. 57–68 (2011)Sahuquillo, J., Pont, A.: The filter cache: a run-time cache management approach1. In: 25th EUROMICRO 1999 Conference (1999

RiuNet

A Hardware Approach to Fairly Balance the Inter-Thread Interference in Shared Caches

Author: Gómez Requena María Engracia
Petit Martí Salvador Vicente
Sahuquillo Borrás Julio
Selfa-Oliver Vicent
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2017
Field of study

[EN] Shared caches have become the common design choice in the vast majority of modern multi-core and many-core processors, since cache sharing improves throughput for a given silicon area. Sharing the cache, however, has a downside: the requests from multiple applications compete among them for cache resources, so the execution time of each application increases over isolated execution. The degree in which the performance of each application is affected by the interference becomes unpredictable yielding the system to unfairness situations. This paper proposes Fair-Progress Cache Partitioning (FPCP), a low-overhead hardware-based cache partitioning approach that addresses system fairness. FPCP reduces the interference by allocating to each application a cache partition and adjusting the partition sizes at runtime. To adjust partitions, our approach estimates during multicore execution the time each application would have taken in isolation, which is challenging. The proposed approach has two main differences over existing approaches. First, FPCP distributes cache ways incrementally, which makes the proposal less prone to estimation errors. Second, the proposed algorithm is much less costly than the state-of-the-art ASM-Cache approach. Experimental results show that, compared to ASM-Cache, FPCP reduces unfairness by 48 percent in four-application workloads and by 28 percent in eight-application workloads, without harming the performance.This work was supported in part by the Spanish Ministerio de Economia y Competitividad (MINECO) and Plan E funds, under grants TIN2014-62246-EXP and TIN2015-66972-C5-1-R.Selfa-Oliver, V.; Sahuquillo Borrás, J.; Petit Martí, SV.; Gómez Requena, ME. (2017). A Hardware Approach to Fairly Balance the Inter-Thread Interference in Shared Caches. IEEE Transactions on Parallel and Distributed Systems. 28(11):3021-3032. https://doi.org/10.1109/TPDS.2017.2713778S30213032281

Crossref

RiuNet

PRISM: an intelligent adaptation of prefetch and SMT levels

Author: Ortega Carrasco Cristóbal
Publication venue: Universitat Politècnica de Catalunya
Publication date: 01/01/2016
Field of study

Current microprocessors include hardware to optimize some specifics workloads. In general, these hardware knobs are set on a default configuration on the booting process of the machine. This default behavior cannot be beneficial for all types of workloads and they are not controlled by anyone but the end user, who needs to know what configuration is the best one for the workload running. Some of these knobs are: (1) the Simultaneous MultiThreading level, which specifies the number of threads that can run simultaneously on a physical CPU, and (2) the data prefetch engine, that manages the prefetches on memory. Parallel programming models are here to stay, and one programming model that succeed in allowing programmers to easily parallelize applications is Open Multi Processing (OMP). Also, the architecture of microprocessors is getting more complex that end users cannot afford to optimize their workloads for all the architectural details. These architectural knobs can help to increase performance but it is needed an automatic and adaptive system managing them. In this work we propose an independent library for OpenMP runtimes to increase performance up to 220% (14.7% on average) while reducing dynamic power consumption up to 13% (2% on average) on a real POWER8 processor

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC