3 research outputs found

    Combining Recency of Information with Selective Random and a Victim Cache in Last-Level Caches

    Full text link
    Memory latency has become an important performance bottleneck in current microprocessors. This problem aggravates as the number of cores sharing the same memory controller increases. To palliate this problem, a common solution is to implement cache hierarchies with large or huge Last-Level Cache (LLC) organizations. LLC memories are implemented with a high number of ways (e.g., 16) to reduce conflict misses. Typically, caches have implemented the LRU algorithm to exploit temporal locality, but its performance goes away from the optimal as the number of ways increases. In addition, the implementation of a strict LRU algorithm is costly in terms of area and power. This article focuses on a family of low-cost replacement strategies, whose implementation scales with the number of ways while maintaining the performance. The proposed strategies track the accessing order for just a few blocks, which cannot be replaced. The victim is randomly selected among those blocks exhibiting poor locality. Although, in general, the random policy helps improving the performance, in some applications the scheme fails with respect to the LRU policy leading to performance degradation. This drawback can be overcome by the addition of a small victim cache of the large LLC. Experimental results show that, using the best version of the family without victim cache, MPKI reduction falls in between 10% and 11% compared to a set of the most representative state-of-the-art algorithms, whereas the reduction grows up to 22% with respect to LRU. The proposal with victim cache achieves speedup improvements, on average, by 4% compared to LRU. In addition, it reduces dynamic energy, on average, up to 8%. Finally, compared to the studied algorithms, hardware complexity is largely reduced by the baseline algorithm of the family.This work was supported by the Spanish MICINN, Consolider Programme, and Plan E funds, as well as European Commission FEDER funds, under Grants CSD2006-00046 and TIN2009-14475-C04-01.Valero Bresó, A.; Sahuquillo Borrás, J.; Petit Martí, SV.; López Rodríguez, PJ.; Duato Marín, JF. (2012). Combining Recency of Information with Selective Random and a Victim Cache in Last-Level Caches. ACM Transactions on Architecture and Code Optimization. 9(3):1-20. doi:10.1145/2355585.2355589S1209

    FOS: a low-power cache organization for multicores

    Get PDF
    [EN] The cache hierarchy of current multicore processors typically consists of one or two levels of private caches per core and a large shared last-level cache. This approach incurs area and energy wasting due to oversizing the private cache space, data replication through the inclusive cache levels, as well as the use of highly set-associative caches. In this paper, we claim that although this is the commonly adopted approach, it presents important design issues that can be addressed by a more energy efficient organization. This work proposes Flat On-chip Storage (FOS), a novel cache organization that, aimed at addressing energy and area on low-power processors, resolves the mentioned issues. For this purpose, FOS combines L2 and L3 cache levels into a single one, organized as a flat space, and composed of a pool of private small cache slices. These slices are initially powered off to save energy, and they are powered on and assigned to cores provided that the system performance is expected to improve. To provide fast and uniform access from the private L1 caches to the FOS's cache slices, multiple architectural challenges are overcome, which entails the design of a custom optical network-on-chip. Experimental results show that FOS achieves significant energy savings on both static and dynamic energy over conventional cache organizations with the same storage capacity. FOS static energy savings are as much as 60% over an electrically connected shared cache; these savings grow up to 75% compared to optically connected baselines. Moreover, despite deactivating part of the cache space, FOS achieves similar performance values as those achieved by conventional approaches.Puche-Lara, J.; Petit Martí, SV.; Sahuquillo Borrás, J.; Gómez Requena, ME. (2019). FOS: a low-power cache organization for multicores. The Journal of Supercomputing (Online). 75(10):6542-6573. https://doi.org/10.1007/s11227-019-02858-xS654265737510Awasthi M, Sudan K, Balasubramonian R, Carter J (2009) Dynamic hardware-assisted software-controlled page placement to manage capacity allocation and sharing within large caches. In: 2009 IEEE 15th International Symposium on High Performance Computer Architecture, pp 250–261. https://doi.org/10.1109/HPCA.2009.4798260Baer J, Low D, Crowley P, Sidhwaney N (2003) Memory hierarchy design for a multiprocessor look-up engine. In: 12th International Conference on Parallel Architectures and Compilation Techniques (PACT 2003)Bahirat S, Pasricha S (2014) Meteor: hybrid photonic ring-mesh network-on-chip for multicore architectures. ACM Trans Embed Comput Syst 13(3s):116:1–116:33. https://doi.org/10.1145/2567940Bartolini S, Grani P (2012) A simple on-chip optical interconnection for improving performance of coherency traffic in CMPS. In: 15th Euromicro Conference on Digital System Design, pp 312–318. https://doi.org/10.1109/DSD.2012.13Beckmann BM, Marty MR, Wood DA (2006) ASR: adaptive selective replication for CMP caches. In: Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 39. IEEE Computer Society, Washington, DC, USA, pp 443–454. https://doi.org/10.1109/MICRO.2006.10Beckmann N, Sanchez D (2013) Jigsaw: scalable software-defined caches. In: Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques, PACT ’13. IEEE Press, Piscataway, NJ, USA, pp 213–224. https://doi.org/10.1109/PACT.2013.6618818Bergman K, Carloni LP, Bibermani AC, Hendry G (2014) Photonic network-on-chip design, vol 68. Springer, New YorkChang J, Sohi GS (2006) Cooperative caching for chip multiprocessors. In: Proceedings 33rd Annual International Symposium on Computer Architecture, pp 264–276. https://doi.org/10.1109/ISCA.2006.17Chen G, Chen H, Haurylau M, Nelson N, Fauchet PM, Friedman EG, Albonesi D (2005) Predictions of CMOS compatible on-chip optical interconnect. In: Proceedings of International Workshop on System Level Interconnect Prediction, SLIP ’05, pp 13–20Chishti Z, Powell MD, Vijaykumar TN (2005) Optimizing replication, communication, and capacity allocation in cmps. SIGARCH Comput Archit News 33(2):357–368. https://doi.org/10.1145/1080695.1070001Cho S, Jin L (2006) Managing distributed, shared l2 caches through os-level page allocation. In: 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’06), pp 455–468. https://doi.org/10.1109/MICRO.2006.31Cianchetti MJ, Kerekes JC, Albonesi DH (2009) Phastlane: a rapid transit optical routing network. In: Proceedings of the 36th Annual International Symposium on Computer Architecture, ISCA’09, pp 441–450. https://doi.org/10.1145/1555754.1555809Demir Y, Hardavellas N (2015) Parka: thermally insulated nanophotonic interconnects. In: NOCS ’15, pp 1:1–1:8. https://doi.org/10.1145/2786572.2786597Duan GH, Fedeli JM, Keyvaninia S, Thomson D (2012) 10 gb/s integrated tunable hybrid iii-v/si laser and silicon mach-zehnder modulator. In: European Conference and Exhibition on Optical Communication. https://doi.org/10.1364/ECEOC.2012.Tu.4.E.2Dybdahl H, Stenstrom P (2007) An adaptive shared/private NUCA cache partitioning scheme for chip multiprocessors. In: 2007 IEEE 13th International Symposium on High Performance Computer Architecture, pp 2–12. https://doi.org/10.1109/HPCA.2007.346180García A, Fernández R, Garca JM, Bartolini S (2014) Managing resources dynamically in hybrid photonic-electronic networks-on-chip. Concurr Comput Pract Exp 26(15):2530–2550. https://doi.org/10.1002/cpe.3332Hardavellas N, Ferdman M, Falsafi B, Ailamaki A (2009) Reactive NUCA: near-optimal block placement and replication in distributed caches. SIGARCH Comput Archit News 37(3):184–195. https://doi.org/10.1145/1555815.1555779Herrero E, González J, Canal R (2008) Distributed cooperative caching. In: Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques, PACT ’08, pp 134–143. https://doi.org/10.1145/1454115.1454136Herrero E, González J, Canal R (2010) Elastic cooperative caching: an autonomous dynamically adaptive memory hierarchy for chip multiprocessors. In: Proceedings of the 37th Annual International Symposium on Computer Architecture, ISCA ’10, pp 419–428. https://doi.org/10.1145/1815961.1816018Huh J, Kim C, Shafi H, Zhang L, Burger D, Keckler SW (2005) A NUCA substrate for flexible CMP cache sharing. In: Proceedings of the 19th Annual International Conference on Supercomputing, ICS ’05. ACM, pp 31–40. https://doi.org/10.1145/1088149.1088154Kahng AB, Li B, Peh LS, Samadi K (2009) Orion 2.0: a fast and accurate NoC power and area model for early-stage design space exploration. In: DATE. European Design and Automation Association, pp 423–428Kaxiras S, Hu Z, Martonosi M (2001) Cache decay: exploiting generational behavior to reduce cache leakage power. In: Proceedings of the 28th Annual International Symposium on Computer Architecture, ISCA’01, pp 240–251Kim S, Chandra D, Solihin D (2004) Fair cache sharing and partitioning in a chip multiprocessor architecture. In: PACT, pp 111–122Merino J, Puente V, Gregorio JA (2010) ESP-NUCA: a low-cost adaptive non-uniform cache architecture. In: HPCA-16 2010 the Sixteenth International Symposium on High-performance Computer Architecture, pp 1–10. https://doi.org/10.1109/HPCA.2010.5416641Morris R, Kodi AK, Louri A (2012) Dynamic reconfiguration of 3d photonic networks-on-chip for maximizing performance and improving fault tolerance. In: 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture, pp 282–293. https://doi.org/10.1109/MICRO.2012.34Muralimanohar N, Balasubramonian R, Jouppi NP (2009) Cacti 6.0: a tool to model large caches. In: HP LaboratoriesPang J, Dwyer C, Lebeck AR (2013) Exploiting emerging technologies for nanoscale photonic networks-on-chip. In: Proceedings of 6th International Workshop on NoC Architectures, NoCArc ’13, pp 53–58Petit S, Sahuquillo J, Such JM, Kaeli DR (2005) Exploiting temporal locality in drowsy cache policies. In: Proceedings of the Second Conference on Computing Frontiers, Ischia, Italy, 4–6 May 2005, pp 371–377Pons L, Selfa V, Sahuquillo J, Petit S, Pons J (2018) Improving system turnaround time with intel CAT by identifying LLC critical applications. In: Euro-Par 2018—Parallel Processing—24th International Conference on Parallel and Distributed Computing, Turin, Italy, 27–31 Aug 2018, Proceedings, pp 603–615. https://doi.org/10.1007/978-3-319-96983-1_43Qureshi M, Patt Y (2006) Utility-based cache partitioning: a low-overhead, high-performance, runtime mechanism to partition shared caches. In: MICRO, pp 423–432Rivers JA, Tam ES, Tyson GS, Davidson ES, Farrens MK (1998) Utilizing reuse information in data cache management. In: Proceedings of the 12th International Conference on Supercomputing, ICS 1998, Melbourne, Australia, 13–17 July 1998, pp 449–456. https://doi.org/10.1145/277830.277941Rosenfeld P, Cooper-Balis E, Jacob B (2011) Dramsim2: a cycle accurate memory system simulator. IEEE Comput Archit Lett 10:16–19. https://doi.org/10.1109/L-CA.2011.4Sahuquillo J, Pont A (1999) The filter cache: a run-time cache management approach1. In: 25th EUROMICRO ’99 Conference, Informatics: Theory and Practice for the New Millenium, 8–10 Sept 1999, Milan, Italy, pp 1424–1431. https://doi.org/10.1109/EURMIC.1999.794504Sahuquillo J, Pont A (2000) Splitting the data cache: a survey. IEEE Concurr 8(3):30–35. https://doi.org/10.1109/4434.865890Selfa V, Sahuquillo J, Eeckhout L, Petit S, Gómez ME (2017) Application clustering policies to address system fairness with intel’s cache allocation technology. In: 26th International Conference on Parallel Architectures and Compilation Techniques, PACT 2017, Portland, OR, USA, 9–13 Sept 2017, pp 194–205. https://doi.org/10.1109/PACT.2017.19Shacham A, Bergman K, Carloni L (2007) On the design of a photonic network-on-chip. In: Networks-on-Chip, NOCS 2007, pp 53–64Soref R, Bennett B (1987) Electrooptical effects in silicon. IEEE J Quantum Electron 23(1):123–129. https://doi.org/10.1109/JQE.1987.1073206Henning JL (2006) SPEC CPU2006 benchmark descriptions. SIGARCH Comput Archit News 34(4):1–17. https://doi.org/10.1145/1186736.1186737Tsai PA, Beckmann N, Sanchez D (2017) Jenga: software-defined cache hierarchies. SIGARCH Comput Archit News 45(2):652–665. https://doi.org/10.1145/3140659.3080214Ubal R, Sahuquillo J, Petit S, Lopez P (2007) Multi2sim: a simulation framework to evaluate multicore-multithreaded processors. In: International Symposium on Computer Architecture and High Performance Computing, pp 62–68. https://doi.org/10.1109/SBAC-PAD.2007.17Valero A, Sahuquillo J, Petit S, López P, Duato J (2012) Combining recency of information with selective random and a victim cache in last-level caches. ACM Trans Archit Code Optim 9(3):16:1–16:20. https://doi.org/10.1145/2355585.2355589Vantrease D, Binkert N, Schreiber R, Lipasti M (2009) Light speed arbitration and flow control for nanophotonic interconnects. In: Microarchitecture, 2009. MICRO-42. 42nd Annual IEEE/ACM International Symposium, pp 304–315Werner S, Navaridas J, Lujan M (2017) Designing low-power, low-latency networks-on-chip by optimally combining electrical and optical links. In: 2017 IEEE International Symposium of High Performance Computer Architectur

    An efficient cache flat storage organization for multithreaded workloads for low power processors

    Full text link
    [EN] The cache hierarchy of current multicores typically consists of three levels, ranging from the faster and smaller L1 level to the slower and larger L3 level. This approach has been demonstrated to be effective in high performance processors, since it reduces the average memory access time. However, when implemented in devices where energy efficiency becomes critical, like low power or embedded processors, conventional cache hierarchies may present some concerns. These concerns, which incur a waste of area and energy, are multiple cache lookups, block replication, block migration and private cache space overprovisioning. To deal with these issues, in this work we propose FOS-Mt, a new cache organization aimed at addressing energy savings in current multicores for multithreaded applications. FOS-Mt's cache hierarchy consists of only two levels: the L1 cache level located in the core pipeline, and a single and flattened second level which conforms an aggregated cache space which is accessible by all the execution cores. This level is sliced into multiple small buffers, which are dynamically assigned to any of the running thread when they are expected to improve the system performance. Those buffers that are not allocated to any core are powered off to save energy. Experimental results show that FOS-Mt significantly reduces both static and dynamic energy consumption over other conventional cache organizations like NUCA or shared caches with the same storage capacity. Compared to the widely known cache decay approach, FOS-Mt achieves an improvement in the energy delay product by 19.3% on average. Moreover, despite the fact that FOS-Mt is an energy-aware architecture, performance is scarcely affected, since it is kept similar to that one achieved by conventional and cache decay approaches.This work has been supported by the Spanish Ministerio de Economia y Competitividad under grant RTI2018-098156-B-C51, and by the Generalitat Valenciana, Spain under grant AICO/2019/317.Puche, J.; Petit Martí, SV.; Gómez Requena, ME.; Sahuquillo Borrás, J. (2020). An efficient cache flat storage organization for multithreaded workloads for low power processors. Future Generation Computer Systems. 110:1037-1054. https://doi.org/10.1016/j.future.2019.11.024S10371054110S. Kaxiras, Z. Hu, M. Martonosi, Cache decay: exploiting generational behavior to reduce cache leakage power, in: Procs. of the 28th Annual International Symposium on Computer Architecture, ISCA’01, 2001, pp. 240–251.Sinharoy, B., Kalla, R. N., Tendler, J. M., Eickemeyer, R. J., & Joyner, J. B. (2005). POWER5 system microarchitecture. IBM Journal of Research and Development, 49(4.5), 505-521. doi:10.1147/rd.494.0505M. Qureshi, Y. Patt, Utility-based cache partitioning: A low-overhead, high-performance, runtime mechanism to partition shared caches, in: MICRO, 2006, pp. 423–432.Selfa, V., Sahuquillo, J., Gómez, M. E., & Gómez, C. (2018). Efficient selective multicore prefetching under limited memory bandwidth. Journal of Parallel and Distributed Computing, 120, 32-43. doi:10.1016/j.jpdc.2018.05.002A. Shacham, K. Bergman, L. Carloni, On the design of a photonic network-on-chip, in: Networks-on-Chip, NOCS 2007, pp. 53–64.G. Chen, H. Chen, M. Haurylau, N. Nelson, P.M. Fauchet, E.G. Friedman, D. Albonesi, Predictions of CMOS compatible on-chip optical interconnect, in: Procs. of Int. Workshop on System Level Interconnect Prediction, SLIP ’05, 2005, pp. 13–20.J. Pang, C. Dwyer, A.R. Lebeck, Exploiting emerging technologies for nanoscale photonic networks-on-chip, in: Procs. of 6th Int. Workshop on NoC Architectures, NoCArc ’13, pp. 53–58.Soref, R., & Bennett, B. (1987). Electrooptical effects in silicon. IEEE Journal of Quantum Electronics, 23(1), 123-129. doi:10.1109/jqe.1987.1073206García-Guirado, A., Fernández-Pascual, R., García, J. M., & Bartolini, S. (2014). Managing resources dynamically in hybrid photonic-electronic networks-on-chip. Concurrency and Computation: Practice and Experience, 26(15), 2530-2550. doi:10.1002/cpe.3332D. Vantrease, N. Binkert, R. Schreiber, M. Lipasti, Light speed arbitration and flow control for nanophotonic interconnects, in: Microarchitecture, 2009. MICRO-42. 42nd Annual IEEE/ACM International Symposium, pp. 304–315.S. Werner, J. Navaridas, M. Lujan, Designing low-power, low-latency networks-on-chip by optimally combining electrical and optical Links, in: 2017 IEEE Int. Symp. of High Performance Computer Architecture, IEEE, Manchester, UK.Bahirat, S., & Pasricha, S. (2014). METEOR. ACM Transactions on Embedded Computing Systems, 13(3s), 1-33. doi:10.1145/2567940R. Morris, A.K. Kodi, A. Louri, Dynamic reconfiguration of 3D photonic networks-on-chip for maximizing performance and improving fault tolerance, in: 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 282–293. http://dx.doi.org/10.1109/MICRO.2012.34.R. Ubal, J. Sahuquillo, S. Petit, P. Lopez, Multi2Sim: A simulation framework to evaluate multicore-multithreaded processors, in: Int. Symp. on Computer Architecture and High Performance Computing, pp. 62–68. http://dx.doi.org/10.1109/SBAC-PAD.2007.17.Rosenfeld, P., Cooper-Balis, E., & Jacob, B. (2011). DRAMSim2: A Cycle Accurate Memory System Simulator. IEEE Computer Architecture Letters, 10(1), 16-19. doi:10.1109/l-ca.2011.4N. Muralimanohar, R. Balasubramonian, N.P. Jouppi, CACTI 6.0: A tool to model large caches, in: HP Laboratories, 2009.. Man-Lap Li, R. Sasanka, S.V. Adve, . Yen-Kuang Chen, E. Debes, The ALPBench benchmark suite for complex multimedia applications, in: Proceedings of the IEEE International Workload Characterization Symposium, 2005, IIWC’05, 2015.Valero, A., Petit, S., Sahuquillo, J., Kaeli, D. R., & Duato, J. (2015). A reuse-based refresh policy for energy-aware eDRAM caches. Microprocessors and Microsystems, 39(1), 37-48. doi:10.1016/j.micpro.2014.12.001Valero, A., Sahuquillo, J., Petit, S., López, P., & Duato, J. (2012). Combining recency of information with selective random and a victim cache in last-level caches. ACM Transactions on Architecture and Code Optimization, 9(3), 1-20. doi:10.1145/2355585.2355589S. Kim, D. Chandra, D. Solihin, Fair cache sharing and partitioning in a chip multiprocessor architecture, in: PACT, 2004, pp. 111–122.Sahuquillo, J., & Pont, A. (2000). Splitting the data cache: a survey. IEEE Concurrency, 8(3), 30-35. doi:10.1109/4434.865890J.A. Rivers, E.S. Tam, G.S. Tyson, E.S. Davidson, M.K. Farrens, Utilizing reuse information in data cache management, in: Proceedings of the 12th International Conference on Supercomputing, ICS 1998, Melbourne, Australia, July 13–17, 1998, 1998, pp. 449–456. http://dx.doi.org/10.1145/277830.277941. URL http://doi.acm.org/10.1145/277830.277941.J. Sahuquillo, A. Pont, The filter cache: A run-time cache management approach1, in: 25th EUROMICRO ’99 Conference, Informatics: Theory and Practice for the New Millenium, 8–10 September 1999, Milan, Italy, 1999, pp. 1424–1431. http://dx.doi.org/10.1109/EURMIC.1999.794504. URL https://doi.org/10.1109/EURMIC.1999.794504.Chishti, Z., Powell, M. D., & Vijaykumar, T. N. (2005). Optimizing Replication, Communication, and Capacity Allocation in CMPs. ACM SIGARCH Computer Architecture News, 33(2), 357-368. doi:10.1145/1080695.1070001Hardavellas, N., Ferdman, M., Falsafi, B., & Ailamaki, A. (2009). Reactive NUCA. ACM SIGARCH Computer Architecture News, 37(3), 184-195. doi:10.1145/1555815.1555779Tsai, P.-A., Beckmann, N., & Sanchez, D. (2017). Jenga. ACM SIGARCH Computer Architecture News, 45(2), 652-665. doi:10.1145/3140659.3080214D. Vantrease, R. Schreiber, M. Monchiero, M. McLaren, N. Jouppi, M. Fiorentino, A. Davis, N. Binkert, R. Beausoleil, J. Ahn, Corona: System implications of emerging nanophotonic technology, in: Computer Architecture, 2008. ISCA ’08. 35th International Symposium on, pp. 153–164. http://dx.doi.org/10.1109/ISCA.2008.35.Y. Pan, J. Kim, G. Memik, FlexiShare: Channel sharing for an energy-efficient nanophotonic crossbar, in: High Performance Computer Architecture, 2010 IEEE 16th International Symposium, pp. 1–12. http://dx.doi.org/10.1109/HPCA.2010.5416626.Pan, Y., Kumar, P., Kim, J., Memik, G., Zhang, Y., & Choudhary, A. (2009). Firefly. ACM SIGARCH Computer Architecture News, 37(3), 429-440. doi:10.1145/1555815.1555808Li, C., Browning, M., Gratz, P. V., & Palermo, S. (2014). LumiNOC: A Power-Efficient, High-Performance, Photonic Network-on-Chip. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 33(6), 826-838. doi:10.1109/tcad.2014.232051
    corecore