125 research outputs found

    Impact of partitioning cache schemes on the cache hierarchy of SMT processors

    Full text link
    © 2015 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.Power consumption is becoming an increasingly important component of processor design. As technology node shrinks both static and dynamic power become more relevant. This is particularly critical for the cache hierarchy. Previous implementations mainly focus on reducing only one kind of power in the cache, either static or dynamic. However, for a more robust approach that will remain relevant as technology continues to shrink, both aspects of power need to be addressed. Recent processors, e.g. Intel Core or IBM Power8, implement simultaneous multithreading (SMT) cores to hide high memory latencies. In these systems, the dynamic energy in the L1 cache is even more stressed since this cache level is shared by several threads running on the same core. This paper proposes and evaluates the use of phase adaptive caches in all structures of a 3-level cache hierarchy of a SMT cores. Compared to the use of conventional caches, our work results on significant dynamic and leakage energy savings with scarce performance impact.This work was supported by the Spanish Ministerio de Economía y Competitividad (MINECO) and by FEDER funds under Grant TIN2012–38341–C04–01.Kenyon, S.; López, S.; Sahuquillo Borrás, J. (2015). Impact of partitioning cache schemes on the cache hierarchy of SMT processors. IEEE. https://doi.org/10.1109/HPCC-CSS-ICESS.2015.127

    Improving GPU cache hierarchy performance with a fetch and replacement cache

    Get PDF
    In the last few years, GPGPU computing has become one of the most popular computing paradigms in high-performance computers due to its excellent performance to power ratio. The memory requirements of GPGPU applications widely differ from the requirements of CPU counterparts. The amount of memory accesses is several orders of magnitude higher in GPU applications than in CPU applications, and they present disparate access patterns. Because of this fact, large and highly associative Last-Level Caches (LLCs) bring much lower performance gains in GPUs than in CPUs. This paper presents a novel approach to manage LLC misses that efficiently improves LLC hit ratio, memory-level parallelism, and miss latencies in GPU systems. The proposed approach leverages a small additional Fetch and Replacement Cache (FRC) that stores control and coherence information of incoming blocks until they are fetched from main memory. Then, fetched blocks are swapped with victim blocks to be replaced in the LLC. After that, the eviction of victim blocks is performed from the FRC. This management approach improves performance due to three main reasons: (i) the lifetime of blocks being replaced is increased, (ii) the main memory path is unclogged on long bursts of LLC misses, and (iii) the average L2 miss delaying latency is reduced. Experimental results show that our proposal increases the performance (OPC) over 25% in most of the studied applications, reaching improvements up to 150% in some applications

    Bringing Real Processorsto Labs

    Full text link
    This is the accepted version of the following article: Gómez, C., Gómez, M. E. and Sahuquillo, J. (2015), Bringing real processors to labs. Comput Appl Eng Educ, 23: 724–732. , which has been published in final form at http://dx.doi.org/10.1002/cae.21645The architecture of current processors has experienced great changes in the last years, leading to sophisticated multithreaded multicore processors. The inherent complexity of such processors makes difficult to update processor teaching to include current commercial products, especially at lab sessions where simplistic simulators are usually used. However, instructors are forced to reduce this gap if they want to properly prepare students in this topic. Dealing with these complex concepts at labs does not only help reinforce theoretical concepts but also has a positive effect in the students motivation. This article presents amethodology designed for the study of current microprocessor mechanisms in a gradual way without overwhelming students. The methodology is based on the use of a detailed simulation framework, used both in the academia and in the industry, which accurately models features from current processors. Due to the huge simulator complexity, it is introduced through several learning phases. Qualitative and quantitative results demonstrate that students are able to develop skills in a detailed simulator in a reasonable time period and, at the same time they learn the details of complex architectural mechanisms of commercial microprocessors.Contract grant sponsor: Spanish Government; Contract grant number: TIN2012-38341-C04-01Gómez Requena, C.; Gómez Requena, ME.; Sahuquillo Borrás, J. (2015). Bringing Real Processorsto Labs. Computer Applications in Engineering Education. 23(5):724-732. https://doi.org/10.1002/cae.21645S724732235D. Sanchez C. Kozyrakis ZSim: Fast and accurate microarchitectural simulation of thousand-core systems 2013 475 486U. Rafael J. Sahuquillo S. Petit P. Lopez Multi2Sim: A simulation framework to evaluate multicore-multithreaded processors 2007 62 68Aziz, S. M., Sicard, E., & Ben Dhia, S. (2010). Effective Teaching of the Physical Design of Integrated Circuits Using Educational Tools. IEEE Transactions on Education, 53(4), 517-531. doi:10.1109/te.2009.2031842Dexter, S. L., Anderson, R. E., & Becker, H. J. (1999). Teachers’ Views of Computers as Catalysts for Changes in Their Teaching Practice. Journal of Research on Computing in Education, 31(3), 221-239. doi:10.1080/08886504.1999.10782252Austin, T., Larson, E., & Ernst, D. (2002). SimpleScalar: an infrastructure for computer system modeling. Computer, 35(2), 59-67. doi:10.1109/2.982917T. E. Carlson W. Heirman L. Eeckhout Sniper: Exploring the level of abstraction for scalable and accurate parallel multi-core simulation 2011 52http://www.multi2sim.orgS. Woo M. Ohara E. Torrie J. Singh A. Gupta The Splash-2 programs: Characterization and methodological considerations 1995 24 3

    Improving System Turnaround Time with Intel CAT by Identifying LLC Critical Applications

    Full text link
    [EN] Resource sharing is a major concern in current multicore processors. Among the shared system resources, the Last Level Cache (LLC) is one of the most critical, since destructive interference between applications accessing it implies more off-chip accesses to main memory, which incur long latencies that can severely impact the overall system performance. To help alleviate this issue, current processors implement huge LLCs, but even so, inter-application interference can harm the performance of a subset of the running applications when executing multiprogram workloads. For this reason, recent Intel processors feature Cache Allocation Technologies (CAT) to partition the cache and assign subsets of cache ways to groups of applications. This paper proposes the Critical-Aware (CA) LLC partitioning approach, which leverages CAT and improves the performance of multiprogram workloads, by identifying and protecting the applications whose performance is more damaged by LLC sharing. Experimental results show that CA improves turnaround time on average by 15%, and up to 40% compared to a baseline system without partitioning.This work was supported by the Spanish Ministerio de Economia y Competitividad (MINECO) and Plan E funds, under grants TIN2015-66972-C5-1-R and TIN2017-92139-EXP. It was also supported by the ExaNest project, with funds from the European Union Horizon 2020 project, with grant agreement No 671553.Pons-Escat, L.; Selfa, V.; Sahuquillo Borrás, J.; Petit Martí, SV.; Pons Terol, J. (2018). Improving System Turnaround Time with Intel CAT by Identifying LLC Critical Applications. Springer. 603-615. https://doi.org/10.1007/978-3-319-96983-1_43603615Sodani, A., et al.: Knights landing: second-generation intel xeon phi product. IEEE Micro 36(2), 34–46 (2016)Qureshi, M.K., Patt, Y.N.: Utility-based cache partitioning: a low-overhead, high-performance, runtime mechanism to partition shared caches. In: Proceedings of MICRO, pp. 423–432 (2006)Manikantan, R., Rajan, K., Govindarajan, R.: Probabilistic shared cache management (PriSM). In: Proceedings of the 39th Annual International Symposium on Computer Architecture (ISCA), pp. 428–439 (2012)El-Sayed, N., Mukkara, A., Tsai, P.A., Kasture, H., Ma, X., Sanchez, D.: Kpart: a hybrid cache partitioning-sharing technique for commodity multicores. In: Proceedings of HPCA (2018)Selfa, V., Sahuquillo, J., Eeckhout, L., Petit, S., Gómez, M.E.: Application clustering policies to address system fairness with intel’s cache allocation technology. In: Proceedings of PACT, pp. 194–205 (2017)Feliu, J., Sahuquillo, J., Petit, S., Duato, J.: Addressing fairness in SMT multicores with a progress-aware scheduler. In: Proceedings of IPDPS, pp. 187–196 (2015)Van Craeynest, K., Akram, S., Heirman, W., Jaleel, A., Eeckhout, L.: Fairness-aware scheduling on single-ISA heterogeneous multi-cores. In: Proceedings of PACT, pp. 177–188 (2013)Wu, C., Li, J., Xu, D., Yew, P.C., Li, J., Wang, Z.: FPS: a fair-progress process scheduling policy on shared-memory multiprocessors. J. Trans. Parallel Distrib. Syst. 26(2), 444–454 (2015)Eyerman, S., Eeckhout, L.: System-level performance metrics for multiprogram workloads. IEEE Micro 28(3), 42–53 (2008)Henning, J.L.: SPEC CPU2006 benchmark descriptions. Comput. Archit. News 34(4), 1–17 (2006)Miller, J.: Short report: reaction time analysis with outlier exclusion: bias varies with sample size. J. Exp. Psychol. 43(4), 907–912 (1991)Leys, C., Ley, C., Klein, O., Bernard, P., Licata, L.: Detecting outliers: do not use standard deviation around the mean, use absolute deviation around the median. J. Exp. Soc. Psychol. 49(4), 764–766 (2013)Gleixner, T., Molnar, I.: Performance counters for Linux (2009)Van Craeynest, K., Akram, S., Heirman, W., Jaleel, A., Eeckhout, L.: Fairness-aware scheduling on single-ISA heterogeneous multi-cores. In: Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques. PACT 2013, Piscataway, NJ, USA, pp. 177–188. IEEE Press (2013)Lo, D., Cheng, L., Govindaraju, R., Ranganathan, P., Kozyrakis, C.: Heracles: improving resource efficiency at scale. In: Proceedings of ISCA, pp. 450–462 (2015)Zhu, H., Erez, M.: Dirigent: enforcing QoS for latency-critical tasks on shared multicore systems. In: Proceedings of ASPLOS, pp. 33–47 (2016)Funaro, L., Ben-Yehuda, O.A., Schuster, A.: Ginseng: market-driven LLC allocation. In: Proceedings of USENIX, pp. 295–308 (2016)Subramanian, L., Seshadri, V., Ghosh, A., Khan, S., Mutlu, O.: The application slowdown model: quantifying and controlling the impact of inter-application interference at shared caches and main memory. In: Proceedings of MICRO, pp. 62–75 (2015)Sanchez, D., Kozyrakis, C.: Vantage: scalable and efficient fine-grain cache partitioning. In: Proceedings of ISCA, pp. 57–68 (2011)Sahuquillo, J., Pont, A.: The filter cache: a run-time cache management approach1. In: 25th EUROMICRO 1999 Conference (1999

    SYNPA: SMT Performance Analysis and Allocation of Threads to Cores in ARM Processors

    Full text link
    Simultaneous multithreading processors improve throughput over single-threaded processors thanks to sharing internal core resources among instructions from distinct threads. However, resource sharing introduces inter-thread interference within the core, which has a negative impact on individual application performance and can significantly increase the turnaround time of multi-program workloads. The severity of the interference effects depends on the competing co-runners sharing the core. Thus, it can be mitigated by applying a thread-to-core allocation policy that smartly selects applications to be run in the same core to minimize their interference. This paper presents SYNPA, a simple approach that dynamically allocates threads to cores in an SMT processor based on their run-time dynamic behavior. The approach uses a regression model to select synergistic pairs to mitigate intra-core interference. The main novelty of SYNPA is that it uses just three variables collected from the performance counters available in current ARM processors at the dispatch stage. Experimental results show that SYNPA outperforms the default Linux scheduler by around 36%, on average, in terms of turnaround time in 8-application workloads combining frontend bound and backend bound benchmarks.Comment: 11 pages, 9 figure

    Thread Isolation to Improve Symbiotic Scheduling on SMT Multicore Processors

    Get PDF
    © 2020 IEEE. Personal use of this material is permitted. Permissíon from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertisíng or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.[EN] Resource sharing is a critical issue in simultaneous multithreading (SMT) processors as threads running simultaneously on an SMT core compete for shared resources. Symbiotic job scheduling, which co-schedules applications with complementary resource demands, is an effective solution to maximize hardware utilization and improve overall system performance. However, symbiotic job scheduling typically distributes threads evenly among cores, i.e., all cores get assigned the same number of threads, which we find to lead to sub-optimal performance. In this paper, we show that asymmetric schedules (i.e., schedules that assign a different number of threads to each SMT core) can significantly improve performance compared to symmetric schedules. To leverage this finding, we propose thread isolation, a technique that turns symmetric schedules into asymmetric ones yielding higher overall system performance. Thread isolation identifies SMT-adverse applications and schedules them in isolation on a dedicated core to mitigate their sharp performance degradation under SMT. Our experimental results on an IBM POWER8 processor show that thread isolation improves system throughput by up to 5.5 percent compared to a state-of-the-art symmetric symbiotic job scheduler.Josue Feliu has been partially supported through a postdoctoral fellowship by the Generalitat Valenciana (APOSTD/2017/052). Additional support has been provided by the Ministerio de Ciencia, Innovacion y Universidades and the European ERDF under Grant RTI2018-098156-B-C51, as well as, by the Universitat Politenica de Valencia through the "Ayudas a Primeros Proyectos de Investigacion" (PAID-06-18) under grant SP20180140. Lieven Eeckhout's research program is supported through FWO grants no. G.0434.16N and G.0144.17N, and the European Research Council (ERC) Advanced Grant agreement no. 741097.Feliu-Pérez, J.; Sahuquillo Borrás, J.; Petit Martí, SV.; Eeckhout, L. (2020). Thread Isolation to Improve Symbiotic Scheduling on SMT Multicore Processors. IEEE Transactions on Parallel and Distributed Systems. 31(2):359-373. https://doi.org/10.1109/TPDS.2019.2934955S35937331

    Prácticas de Diseño de Sistemas de Memoria

    Get PDF
    Los conocimientos a impartir en los estudios universitarios de Informática sobre el sistema de memoria se prestan a un aprendizaje por niveles, desde el conocimiento básico de una celda de bit hasta el diseño de un mapa de memoria. Como consecuencia, las prácticas que se diseñen para reforzar estos conocimientos también pueden seguir una organización y secuenciación por niveles. Por esta razón se ha diseñado un conjunto de prácticas sobre el sistema de memoria, dentro del contexto de los estudios universitarios de Informática de la Universidad Politécnica de Valencia, para ser realizadas mediante una aproximación por niveles, empezando por el diseño de un chip de memoria, siguiendo por la construcción de un módulo de memoria a partir de chips y finalizando con el diseño de un mapa de memoria utilizando varios módulos de memoria de diferentes características. En dichas prácticas se ha escogido la herramienta de diseño y simulación digital Xilinx. Esta herramienta posee una gran versatilidad y flexibilidad en todas las fases de la realización de un sistema electrónico digital, desde su diseño hasta su validación por simulación y posterior implementación. Además, el alumno también utiliza dicha herramienta en otras asignaturas de la titulación por lo que ya se encuentra familiarizado con ella. Por otra parte, Xilinx es una herramienta que podrá ser utilizada por el alumno en su cercano futuro profesional debido a su potencia y su gran aceptación. En este artículo se describen brevemente las prácticas de diseño de sistemas de memoria que se han diseñado con Xilinx en el mencionado contexto docente

    Prácticas Experimentales de Memorias Cache

    Get PDF
    El conocimiento de las memorias cache se considera básico e imprescindible en la formación de cualquier titulado universitario (ya sea técnico o superior) en Informática. Su estudio se puede abordar desde distintos puntos de vista. Por una parte, desde un punto de vista teórico describiendo su funcionamiento básico: cómo se localiza un bloque almacenado, qué bloque debe reemplazarse, etc. Por otra parte, desde un punto de vista práctico, comprobando el funcionamiento estudiado, normalmente mediante un simulador. Este artículo se enmarca dentro del punto de vista práctico y propone el uso de prácticas experimentales para el estudio de las memorias cache como complemento a los simuladores. Se trata de utilizar un sencillo programa de medida de tiempos de acceso al sistema de memoria, para comprobar experimentalmente el efecto del sistema de cache sobre las prestaciones de un computador personal. Mediante la realización de las prácticas el alumno deduce cuestiones como: ¿cuántos niveles de cache tiene el computador?, ¿cuántas vías tiene cada uno de ellos?, ¿cuál es su tamaño de bloque?, etc. Estas prácticas se han realizado por primera vez durante el presente curso académico en la Escuela Técnica Superior de Informática Aplicada y en la Facultad de Informática de la Universidad Politécnica de Valencia con un buen grado de aceptación tanto por parte del profesorado como del alumnado

    Efficient Management of Cache Accesses to Boost GPGPU Memory Subsystem Performance

    Full text link
    "© 2019 IEEE. Personal use of this material is permitted. Permissíon from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertisíng or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works."[EN] To support the massive amount of memory accesses that GPGPU applications generate, GPU memory hierarchies are becoming more and more complex, and the Last Level Cache (LLC) size considerably increases each GPU generation. This paper shows that counter-intuitively, enlarging the LLC brings marginal performance gains in most applications. In other words, increasing the LLC size does not scale neither in performance nor energy consumption. We examine how LLC misses are managed in typical GPUs, and we find that in most cases the way LLC misses are managed are precisely the main performance limiter. This paper proposes a novel approach that addresses this shortcoming by leveraging a tiny additional Fetch and Replacement Cache-like structure (FRC) that stores control and coherence information of the incoming blocks until they are fetched from main memory. Then, the fetched blocks are swapped with the victim blocks (i.e., selected to be replaced) in the LLC, and the eviction of such victim blocks is performed from the FRC. This approach improves performance due to three main reasons: i) the lifetime of blocks being replaced is enlarged, ii) the main memory path is unclogged on long bursts of LLC misses, and iii) the average LLC miss latency is reduced. The proposal improves the LLC hit ratio, memory-level parallelism, and reduces the miss latency compared to much larger conventional caches. Moreover, this is achieved with reduced energy consumption and with much less area requirements. Experimental results show that the proposed FRC cache scales in performance with the number of GPU compute units and the LLC size, since, depending on the FRC size, performance improves ranging from 30 to 67 percent for a modern baseline GPU card, and from 32 to 118 percent for a larger GPU. In addition, energy consumption is reduced on average from 49 to 57 percent for the larger GPU. These benefits come with a small area increase (by 7.3 percent) over the LLC baseline.This work has been supported by the Spanish Ministerio de Ciencia, Innovacion y Universidades and the European ERDF under Grants T-PARCCA (RTI2018-098156-B-C51), and TIN2016-76635-C2-1-R (AEI/ERDF, EU), by the Universitat Politecnica de Valencia under Grant SP20190169, and by the gaZ: T58_17R research group (Aragon Gov. and European ESF).Candel-Margaix, F.; Valero Bresó, A.; Petit Martí, SV.; Sahuquillo Borrás, J. (2019). Efficient Management of Cache Accesses to Boost GPGPU Memory Subsystem Performance. IEEE Transactions on Computers. 68(10):1442-1454. https://doi.org/10.1109/TC.2019.2907591S14421454681

    The Tag Filter Architecture: An energy-efficient cache and directory design

    Full text link
    [EN] Power consumption in current high-performance chip multiprocessors (CMPs) has become a major design concern that aggravates with the current trend of increasing the core count. A significant fraction of the total power budget is consumed by on-chip caches which are usually deployed with a high associativity degree (even L1 caches are being implemented with eight ways) to enhance the system performance. On a cache access, each way in the corresponding set is accessed in parallel, which is costly in terms of energy. On the other hand, coherence protocols also must implement efficient directory caches that scale in terms of power consumption. Most of the state-of-the-art techniques that reduce the energy consumption of directories are at the cost of performance, which may become unacceptable for high-performance CMPs. In this paper, we propose an energy-efficient architectural design that can be effectively applied to any kind of cache memory. The proposed approach, called the Tag Filter (TF) Architecture, filters the ways accessed in the target cache set, and just a few ways are searched in the tag and data arrays. This allows the approach to reduce the dynamic energy consumption of caches without hurting their access time. For this purpose, the proposed architecture holds the XX least significant bits of each tag in a small auxiliary X-bit-wide array. These bits are used to filter the ways where the least significant bits of the tag do not match with the bits in the X-bit array. Experimental results show that, on average, the TF Architecture reduces the dynamic power consumption across the studied applications up to 74.9%74.9%, 85.9%85.9%, and 84.5%84.5% when applied to L1 caches, L2 caches, and directory caches, respectively.This work has been jointly supported by MINECO and European Commission (FEDER funds) under the project TIN2015-66972-C5-1-R/3-R and by Fundación Séneca, Agencia de Ciencia y Tecnología de la Región de Murcia under the project Jóvenes Líderes en Investigación 18956/JLI/13.Valls, J.; Ros Bardisa, A.; Gómez Requena, ME.; Sahuquillo Borrás, J. (2017). The Tag Filter Architecture: An energy-efficient cache and directory design. Journal of Parallel and Distributed Computing. 100:193-202. https://doi.org/10.1016/j.jpdc.2016.04.016S19320210
    • …
    corecore