Search CORE

103 research outputs found

The Tag Filter Architecture: An energy-efficient cache and directory design

Author: Gómez Requena María Engracia
Ros Bardisa Alberto
Sahuquillo Borrás Julio
Valls Joan
Publication venue: 'Elsevier BV'
Publication date: 01/01/2017
Field of study

[EN] Power consumption in current high-performance chip multiprocessors (CMPs) has become a major design concern that aggravates with the current trend of increasing the core count. A significant fraction of the total power budget is consumed by on-chip caches which are usually deployed with a high associativity degree (even L1 caches are being implemented with eight ways) to enhance the system performance. On a cache access, each way in the corresponding set is accessed in parallel, which is costly in terms of energy. On the other hand, coherence protocols also must implement efficient directory caches that scale in terms of power consumption. Most of the state-of-the-art techniques that reduce the energy consumption of directories are at the cost of performance, which may become unacceptable for high-performance CMPs. In this paper, we propose an energy-efficient architectural design that can be effectively applied to any kind of cache memory. The proposed approach, called the Tag Filter (TF) Architecture, filters the ways accessed in the target cache set, and just a few ways are searched in the tag and data arrays. This allows the approach to reduce the dynamic energy consumption of caches without hurting their access time. For this purpose, the proposed architecture holds the XX least significant bits of each tag in a small auxiliary X-bit-wide array. These bits are used to filter the ways where the least significant bits of the tag do not match with the bits in the X-bit array. Experimental results show that, on average, the TF Architecture reduces the dynamic power consumption across the studied applications up to 74.9%74.9%, 85.9%85.9%, and 84.5%84.5% when applied to L1 caches, L2 caches, and directory caches, respectively.This work has been jointly supported by MINECO and European Commission (FEDER funds) under the project TIN2015-66972-C5-1-R/3-R and by Fundación Séneca, Agencia de Ciencia y Tecnología de la Región de Murcia under the project Jóvenes Líderes en Investigación 18956/JLI/13.Valls, J.; Ros Bardisa, A.; Gómez Requena, ME.; Sahuquillo Borrás, J. (2017). The Tag Filter Architecture: An energy-efficient cache and directory design. Journal of Parallel and Distributed Computing. 100:193-202. https://doi.org/10.1016/j.jpdc.2016.04.016S19320210

Crossref

RiuNet

A Survey of Techniques for Architecting TLBs

Author: Mittal Sparsh
Publication venue: 'Wiley'
Publication date: 01/01/2016
Field of study

“Translation lookaside buffer” (TLB) caches virtual to physical address translation information and is used in systems ranging from embedded devices to high-end servers. Since TLB is accessed very frequently and a TLB miss is extremely costly, prudent management of TLB is important for improving performance and energy efficiency of processors. In this paper, we present a survey of techniques for architecting and managing TLBs. We characterize the techniques across several dimensions to highlight their similarities and distinctions. We believe that this paper will be useful for chip designers, computer architects and system engineers

Research Archive of Indian Institute of Technology Hyderabad

Victima: Drastically Increasing Address Translation Reach by Leveraging Underutilized Cache Resources

Author: Bartolini Davide-Basilio
Bera Rahul
Bostanci F. Nisa
Kanellopoulos Konstantinos
Kumar Rakesh
Mutlu Onur
Nam Hong Chul
Sadrosadati Mohammad
Publication venue
Publication date: 13/10/2023
Field of study

Address translation is a performance bottleneck in data-intensive workloads due to large datasets and irregular access patterns that lead to frequent high-latency page table walks (PTWs). PTWs can be reduced by using (i) large hardware TLBs or (ii) large software-managed TLBs. Unfortunately, both solutions have significant drawbacks: increased access latency, power and area (for hardware TLBs), and costly memory accesses, the need for large contiguous memory blocks, and complex OS modifications (for software-managed TLBs). We present Victima, a new software-transparent mechanism that drastically increases the translation reach of the processor by leveraging the underutilized resources of the cache hierarchy. The key idea of Victima is to repurpose L2 cache blocks to store clusters of TLB entries, thereby providing an additional low-latency and high-capacity component that backs up the last-level TLB and thus reduces PTWs. Victima has two main components. First, a PTW cost predictor (PTW-CP) identifies costly-to-translate addresses based on the frequency and cost of the PTWs they lead to. Second, a TLB-aware cache replacement policy prioritizes keeping TLB entries in the cache hierarchy by considering (i) the translation pressure (e.g., last-level TLB miss rate) and (ii) the reuse characteristics of the TLB entries. Our evaluation results show that in native (virtualized) execution environments Victima improves average end-to-end application performance by 7.4% (28.7%) over the baseline four-level radix-tree-based page table design and by 6.2% (20.1%) over a state-of-the-art software-managed TLB, across 11 diverse data-intensive workloads. Victima (i) is effective in both native and virtualized environments, (ii) is completely transparent to application and system software, and (iii) incurs very small area and power overheads on a modern high-end CPU.Comment: To appear in 56th IEEE/ACM International Symposium on Microarchitecture (MICRO), 202

arXiv.org e-Print Archive

CROSS-LAYER CUSTOMIZATION PLATFORM FOR LOW-POWER AND REAL-TIME EMBEDDED APPLICATIONS

Author: Zhou Xiangrong
Publication venue
Publication date: 01/07/2008
Field of study

Modern embedded applications have become increasingly complex and diverse in their functionalities and requirements. Data processing, communication and multimedia signal processing, real-time control and various other functionalities can often need to be implemented on the same System-on-Chip(SOC) platform. The significant power constraints and real-time guarantee requirements of these applications have become significant obstacles for the traditional embedded system design methodologies. The general-purpose computing microarchitectures of these platforms are designed to achieve good performance on average, which is far from optimal for any particular application. The system must always assume worst-case scenarios, which results in significant power inefficiencies and resource under-utilization. This dissertation introduces a cross-layer application-customizable embedded platform, which dynamically exploits application information and fine-tunes system components at system software and hardware layers. This is achieved with the close cooperation and seamless integration of the compiler, the operating system, and the hardware architecture. The compiler is responsible for extracting application regularities through static and profile-based analysis. The relevant application knowledge is propagated and utilized at run-time across the system layers through the judiciously introduced reconfigurability at both OS and hardware layers. The introduced framework comprehensively covers the fundamental subsystems of memory management and multi-tasking execution control

Digital Repository at the University of Maryland

Improving Energy and Area Scalability of the Cache Hierarchy in CMPs

Author: Valls Mompó Joan Josep
Publication venue: 'Universitat Politecnica de Valencia'
Publication date: 07/04/2017
Field of study

As the core counts increase in each chip multiprocessor generation, CMPs should improve scalability in performance, area, and energy consumption to meet the demands of larger core counts. Directory-based protocols constitute the most scalable alternative. A conventional directory, however, suffers from an inefficient use of storage and energy. First, the large, non-scalable, sharer vectors consume unnecessary area and leakage, especially considering that most of the blocks tracked in a directory are cached by a single core. Second, although increasing directory size and associativity could boost system performance by reducing the coverage misses, it would come at the expense of area and energy consumption. This thesis focuses and exploits the important differences of behavior between private and shared blocks from the directory point of view. These differences claim for a separate management of both types of blocks at the directory. First, we propose the PS-Directory, a two-level directory cache that keeps the reduced number of frequently accessed shared entries in a small and fast first-level cache, namely Shared Directory Cache, and uses a larger and slower second-level Private Directory Cache to track the large amount of private blocks. Experimental results show that, compared to a conventional directory, the PS-Directory improves performance while also reducing silicon area and energy consumption. In this thesis we also show that the shared/private ratio of entries in the directory varies across applications and across different execution phases within the applications, which encourages us to propose Dynamic Way Partitioning (DWP) Directory. DWP-Directory reduces the number of ways with storage for shared blocks and it allows this storage to be powered off or on at run-time according to the dynamic requirements of the applications following a repartitioning algorithm. Results show similar performance as a traditional directory with high associativity, and similar area requirements as recent state-of-the-art schemes. In addition, DWP-Directory achieves notable static and dynamic power consumption savings. This dissertation also deals with the scalability issues in terms of power found in processor caches. A significant fraction of the total power budget is consumed by on-chip caches which are usually deployed with a high associativity degree (even L1 caches are being implemented with eight ways) to enhance the system performance. On a cache access, each way in the corresponding set is accessed in parallel, which is costly in terms of energy. This thesis presents the PS-Cache architecture, an energy-efficient cache design that reduces the number of accessed ways without hurting the performance. The PS-Cache takes advantage of the private-shared knowledge of the referenced block to reduce energy by accessing only those ways holding the kind of block looked up. Results show significant dynamic power consumption savings. Finally, we propose an energy-efficient architectural design that can be effectively applied to any kind of set-associative cache memory, not only to processor caches. The proposed approach, called the Tag Filter (TF) Architecture, filters the ways accessed in the target cache set, and just a few ways are searched in the tag and data arrays. This allows the approach to reduce the dynamic energy consumption of caches without hurting their access time. For this purpose, the proposed architecture holds the X least significant bits of each tag in a small auxiliary X-bit-wide array. These bits are used to filter the ways where the least significant bits of the tag do not match with the bits in the X-bit array. Experimental results show that this filtering mechanism achieves energy consumption in set-associative caches similar to direct mapped ones. Experimental results show that the proposals presented in this thesis offer a good tradeoff among these three major design axes.Conforme se incrementa el número de núcleos en las nuevas generaciones de multiprocesadores en chip, los CMPs deben de escalar en prestaciones, área y consumo energético para cumplir con las demandas de un número núcleos mayor. Los protocolos basados en directorio constituyen la alternativa más escalable. Un directorio convencional, no obstante, sufre de una utilización ineficiente de almacenamiento y energía. En primer lugar, los grandes y poco escalables vectores de compartidores consumen una cantidad de energía de fuga y de área innecesaria, especialmente si se tiene en consideración que la mayoría de los bloques en un directorio solo se encuentran en la cache de un único núcleo. En segundo lugar, aunque incrementar el tamaño y la asociatividad del directorio aumentaría las prestaciones del sistema, esto supondría un incremento notable en el consumo energético. Esta tesis estudia las diferencias significativas entre el comportamiento de bloques privados y compartidos en el directorio, lo que nos lleva hacia una gestión separada para cada uno de los tipos de bloque. Proponemos el PS-Directory, una cache de directorio de dos niveles que mantiene el reducido número de las entradas compartidas, que son los que se acceden con más frecuencia, en una estructura pequeña de primer nivel (concretamente, la Shared Directory Cache) y que utiliza una estructura más grande y lenta en el segundo nivel (Private Directory Cache) para poder mantener la información de los bloques privados. Los resultados experimentales muestran que, comparado con un directorio convencional, el PS-Directory consigue mejorar las prestaciones a la vez que reduce el área de silicio y el consumo energético. Ya que el ratio compartido/privado de las entradas en el directorio varia entre aplicaciones y entre las diferentes fases de ejecución dentro de las aplicaciones, proponemos el Dynamic Way Partitioning (DWP) Directory. El DWP-Directory reduce el número de vías que almacenan entradas compartidas y permite que éstas se enciendan o apaguen en tiempo de ejecución según los requisitos dinámicos de las aplicaciones según un algoritmo de reparticionado. Los resultados muestran unas prestaciones similares a un directorio tradicional de alta asociatividad y un área similar a otros esquemas recientes del estado del arte. Adicionalmente, el DWP-Directory obtiene importantes reducciones de consumo estático y dinámico. Esta disertación también se enfrenta a los problemas de escalabilidad que se pueden encontrar en las memorias cache. En un acceso a la cache, se accede a cada vía del conjunto en paralelo, siendo así un acción costosa en energía. Esta tesis presenta la arquitectura PS-Cache, un diseño energéticamente eficiente que reduce el número de vías accedidas sin perjudicar las prestaciones. La PS-Cache utiliza la información del estado privado-compartido del bloque referenciado para reducir la energía, ya que tan solo accedemos a un subconjunto de las vías que mantienen los bloques del tipo solicitado. Los resultados muestran unos importantes ahorros de energía dinámica. Finalmente, proponemos otro diseño de arquitectura energéticamente eficiente que se puede aplicar a cualquier tipo de memoria cache asociativa por conjuntos. La propuesta, la Tag Filter (TF) Architecture, filtra las vías accedidas en el conjunto de la cache, de manera que solo se mira un número reducido de vías tanto en el array de etiquetas como en el de datos. Esto permite que nuestra propuesta reduzca el consumo de energía dinámico de las caches sin perjudicar su tiempo de acceso. Los resultados experimentales muestran que este mecanismo de filtrado es capaz de obtener un consumo energético en caches asociativas por conjunto similar de las caches de mapeado directo. Los resultados experimentales muestran que las propuestas presentadas en esta tesis consiguen un buen compromiso entre estos tres importantes pilares de diseño.Conforme s'incrementen el nombre de nuclis en les noves generacions de multiprocessadors en xip, els CMPs han d'escalar en prestacions, àrea i consum energètic per complir en les demandes d'un nombre de nuclis major. El protocols basats en directori són l'alternativa més escalable. Un directori convencional, no obstant, pateix una utilització ineficient d'emmagatzematge i energia. En primer lloc, els grans i poc escalables vectors de compartidors consumeixen una quantitat d'energia estàtica i d'àrea innecessària, especialment si es considera que la majoria dels blocs en un directori només es troben en la cache d'un sol nucli. En segon lloc, tot i que incrementar la grandària i l'associativitat del directori augmentaria les prestacions del sistema, això suposaria un increment notable en el consum d'energia. Aquesta tesis estudia les diferències significatives entre el comportament de blocs privats i compartits dins del directori, la qual cosa ens guia cap a una gestió separada per a cada un dels tipus de bloc. Proposem el PS-Directory, una cache de directori de dos nivells que manté el reduït nombre de les entrades de blocs compartits, que són els que s'accedeixen amb més freqüència, en una estructura menuda de primer nivell (concretament, la Shared Directory Cache) i que empra una estructura més gran i lenta en el segon nivell (Private Directory Cache) per poder mantenir la informació dels blocs privats. Els resultats experimentals mostren que, comparat amb un directori convencional, el PS-Directory aconsegueix millorar les prestacions a la vegada que redueix l'àrea de silici i el consum energètic. Ja que la ràtio compartit/privat de les entrades en el directori varia entre aplicacions i entre les diferents fases d'execució dins de les aplicacions, proposem el Dynamic Way Partitioning (DWP) Directory. DWP-Directory redueix el nombre de vies que emmagatzemen entrades compartides i permeten que aquest s'encengui o apagui en temps d'execució segons els requeriments dinàmics de les aplicacions seguint un algoritme de reparticionat. Els resultats mostren unes prestacions similars a un directori tradicional d'alta associativitat i una àrea similar a altres esquemes recents de l'estat de l'art. Adicionalment, el DWP-Directory obté importants reduccions de consum estàtic i dinàmic. Aquesta dissertació també s'enfronta als problemes d'escalabilitat que es poden tro- bar en les memòries cache. Les caches on-chip consumeixen una part significativa del consum total del sistema. Aquestes caches implementen un alt nivell d'associativitat. En un accés a la cache, s'accedeix a cada via del conjunt en paral·lel, essent així una acció costosa en energia. Aquesta tesis presenta l'arquitectura PS-Cache, un disseny energèticament eficient que redueix el nombre de vies accedides sense perjudicar les prestacions. La PS-Cache utilitza la informació de l'estat privat-compartit del bloc referenciat per a reduir energia, ja que només accedim al subconjunt de vies que mantenen blocs del tipus sol·licitat. Els resultats mostren uns importants estalvis d'energia dinàmica. Finalment, proposem un altre disseny d'arquitectura energèticament eficient que es pot aplicar a qualsevol tipus de memòria cache associativa per conjunts. La proposta, la Tag Filter (TF) Architecture, filtra les vies accedides en el conjunt de la cache, de manera que només un reduït nombre de vies es miren tant en el array d'etiquetes com en el de dades. Això permet que la nostra proposta redueixi el consum dinàmic energètic de les caches sense perjudicar el seu temps d'accés. Els resultats experimentals mostren que aquest mecanisme de filtre és capaç d'obtenir un consum energètic en caches associatives per conjunt similar al de les caches de mapejada directa. Els resultats experimentals mostren que les propostes presentades en aquesta tesis conseguixen un bon compromís entre aquestros tres importants pilars de diseny.Valls Mompó, JJ. (2017). Improving Energy and Area Scalability of the Cache Hierarchy in CMPs [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/79551TESI

Crossref

RiuNet

Recommended from our members

Improving virtual memory performance in virtualized environments

Author: Marathe Yashwant
Publication venue
Publication date: 25/02/2021
Field of study

Virtual Memory is a major system performance bottleneck in virtualized environments. In addition to expensive address translations, frequent virtual machine context switches are common in virtualized environments, resulting in increased TLB miss rates, subsequent expensive page walks and data cache contention due to incoming page table entries evicting useful data. Orthogonally, translation coherence, which is currently an expensive operation implemented in software, can consume up to 50% of the runtime of an application executing on the guest. To improve the performance of virtual memory in virtualized environments, two solutions have been proposed in this thesis - namely, (1) Context Switch Aware Large TLB (CSALT), an architecture which addresses the problem of increased TLB miss rates and their adverse impact on data caches. CSALT copes with the increased demand of context switches by storing a large number TLB entries. It mitigates data cache contention by employing a novel TLB-aware cache partitioning scheme. On 8-core systems that switch between two virtual machine contexts executing multi-threaded workloads, CSALT achieves an average performance improvement of 85% over a baseline with conventional L1-L2 TLBs and 25% over a baseline which has a large L3 TLB (2) Translation Coherence using Addressable TLBs (TCAT), a hardware translation coherence scheme which eliminates almost all of the overheads associated with address translation coherence. TCAT overlays translation coherence atop cache coherence to accurately identify slave cores. It then leverages the addressable Part-Of-Memory TLB (POM-TLB) to eliminate expensive Inter Processor Interrupts (IPI) and achieve precise invalidations on the slave core. On 8-core systems with one virtual machine context executing multi-threaded workloads, TCAT achieves an average performance improvement of 13% over the kvmtlb baselineElectrical and Computer Engineerin

Texas ScholarWorks

PS-Cache: an energy-efficient cache design for chip multiprocessors

Author: Gómez Requena María Engracia
Ros Bardisa Alberto
Sahuquillo Borrás Julio
Valls Joan J
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2015
Field of study

The final publication is available at Springer via http://dx.doi.org/10.1007/s11227-014-1288-5Power consumption has become a major design concern in current high-performance chip multiprocessors, and this problem exacerbates with the number of core counts. A significant fraction of the total power budget is often consumed by on-chip caches, thus important research has focused on reducing energy consumption in these structures. To enhance performance, on-chip caches are being deployed with a high associativity degree. Consequently, accessing concurrently all the ways in the cache set is costly in terms of energy. This paper presents the PS-Cache architecture, an energy-efficient cache design that reduces the number of accessed ways without hurting the performance. The PS-Cache takes advantage of the private-shared knowledge of the referenced block to reduce energy by accessing only those ways holding the kind of block looked up. Experimental results show that, on average, the PS-Cache architecture can reduce the dynamic energy consumption of L1 and L2 caches by 22 and 40%, respectively.This work has been jointly supported by the MINECO and European Commission (FEDER funds) under the project TIN2012-38341-C04-01 and the Fundaci’on Seneca-Agencia de Ciencia y Tecnología de la Región de Murcia under the project Jóvenes Líderes en Investigación 18956/JLI/13.Valls, JJ.; Ros Bardisa, A.; Sahuquillo Borrás, J.; Gómez Requena, ME. (2015). PS-Cache: an energy-efficient cache design for chip multiprocessors. Journal of Supercomputing. 71(1):67-86. https://doi.org/10.1007/s11227-014-1288-5S6786711Balasubramonian R, Jouppi NP, Muralimanohar N (2011) Multi-core cache hierarchies. In: Synthesis lectures on computer architecture. Morgan & Claypool Publishers, San RafaelHennessy JL, Patterson DA (2011) Computer architecture, fifth edition: a quantitative approach, 5th edn. Morgan Kaufmann Publishers Inc., San FranciscoSinharoy B, Kalla R, Starke WJ, Le HQ, Cargnoni R, Van Norstrand JA, Ronchetti BJ, Stuecheli J, Leenstra J, Guthrie GL, Nguyen DQ, Blaner B, Marino CF, Retter E, Williams P (2011) IBM POWER7 multicore server processor. IBM J Res Dev 5(3):1:1-1:29 doi: 10.1147/JRD.2011.2127330Kaxiras S, Hu Z, Martonosi M (2011) 28th International symposium on computer architecture (ISCA), pp 240–251Flautner K, Kim NS, Martin S, Blaauw D, Kaxiras TM, Hu Z, Martonosi M (2002) 29th International symposium on computer architecture (ISCA), pp 148–157Ghosh M, Özer E, Ford S, Biles S, Lee HHS (2009) International symposium on low power electronics and design (ISLPED), pp 165–170Calder B, Grunwald D (1996) 2nd international symposium on high-performance computer architecture (HPCA) (1996), pp 244–253Hardavellas N, Ferdman M, Falsafi B, Ailamaki A (2009) 36th international symposium on computer architecture (ISCA), pp 184–195Cuesta B, Ros A, Gómez ME, Robles A, Duato J (2011) 38th international symposium on computer architecture (ISCA), pp 93–103Pugsley SH, Spjut JB, Nellans DW, Balasubramonian R (2010) 19th international conference on parallel architectures and compilation techniques (PACT), pp 465–476Hossain H, Dwarkadas S, Huang MC (2011) 20th international conference on parallel architectures and compilation techniques (PACT), pp 45–55Kim D, Kim JAJ, Huh J (2010) 19th international conference on parallel architectures and compilation techniques (PACT), pp 111–122Ros A, Kaxiras S (2012) 21st international conference on parallel architectures and compilation techniques (PACT), pp 241–252Sundararajan KT, Porpodas V, Jones TM, Topham NP, Franke B (2012) 18th international symposium on high-performance computer architecture (HPCA), pp 311–322Agarwal N, Peh LS, Jha NK (2009) 15th international symposium on high-performance computer architecture (HPCA), pp 67–78Cantin JF, Smith JE, Lipasti MH, Moshovos A, Falsafi B (2006) Coarse-grain coherence tracking: región scout and region coherence arrays. IEEE Micro 26(1):70–95Ferdman M, Lotfi-Kamran P, Balet K, Falsafi B (2011) 17th international symposium on high-performance computer architecture (HPCA), pp 169–180Zebchuk J, Srinivasan V, Qureshi MK, Moshovos A (2009) 42nd IEEE/ACM international symposium on microarchitecture (MICRO), pp 423–434Powell M, Hyun Yang S, Falsafi B, Roy K, Vijaykumar TN (2000) International symposium on low power electronics and design (ISLPED), pp 90–95Albonesi DH (1999) 32nd IEEE/ACM international symposium on microarchitecture (MICRO), pp 248–259Zhang C, Vahid F, Yang J, Najjar W (2005) A way-halting cache for low-energy high-performance systems. ACM Transactions on Architecture and Code Optimization. 2(1):34–54Ghosh M, Özer E, Biles S, Lee HHS (2006) 19th international conference on architecture of computing systems (ARCS), pp 283–297Lee J, Hong S, Kim S (2011) 17th international symposium on low power electronics and design (ISLPED), pp 85–90Kedzierski K, Cazorla FJ, Gioiosa R, Buyuktosunoglu A, Valero M (2010) 2nd international forum on next-generation multicore/manycore technologies, pp 1–12Alouani I, Niar S, Kurdahi F, Abid M (2012) 23rd IEEE international symposium on rapid system prototyping (RSP), pp 44–48Meng J, Skadron K (2009) International conference on computer design (ICCD), pp 282–288Li Y, Abousamra A, Melhem R, Jones AK (2010) 19th international conference on parallel architectures and compilation techniques (PACT), pp 501–512Li Y, Melhem RG, Jones AK (2012) 21st international conference on parallel architectures and compilation techniques (PACT), pp 231–240Alisafaee M (2012) 45th IEEE/ACM international symposium on microarchitecture (MICRO), pp 341–350Jiang G, Fen D, Tong L, Xiang L, Wang C, Chen T (2009) 8th international symposium on advanced parallel processing technologies. Springer, Berlin, pp 123–133Sundararajan K, Jones T, Topham N (2013) IEEE 31st international conference on computer design (ICCD), pp 294–301Valls JJ, Ros A, Sahuquillo J, Gómez ME, Duato J (2012) 21st international conference on parallel architectures and compilation techniques (PACT), pp 451–452Ros A, Cuesta B, Gómez ME, Robles A, Duato J (2013) 42nd international conference on parallel processing (ICPP), pp 562–571Jacob B, Ng S, Wang D (2007) Memory systems: cache, DRAM, disk, 4th edn. Morgan Kaufmann Publishers Inc., San FranciscoPatterson DA, Hennessy JL (2008) Computer organization and design: the hardware/software interface. The Morgan Kaufmann Series in Computer Architecture and Design, 4th edn. Morgan Kaufmann Publishers Inc., San FranciscoMagnusson PS, Christensson M, Eskilson J, Forsgren D, Hallberg G, Hogberg J, Larsson F, Moestedt A, Werner B (2002) Simics: a full system simulation platform. IEEE Comput 35(2):50–58Martin MM, Sorin DJ, Beckmann BM, Marty MR, Xu M, Alameldeen AR, Moore KE, Hill MD, Wood DA (2005) Multifacet’s general execution-driven multiprocessor simulator GEMS toolset. Comput Archit News 33(4):92–99Agarwal N, Krishna T, Peh LS, Jha NK (2009) IEEE international symposium on performance analysis of systems and software (ISPASS), pp 33–42Muralimanohar N, Balasubramonian R, Jouppi NP (2009) Cacti 6.0. Tech. Rep. HPL-2009-85, HP LabsWoo SC, Ohara M, Torrie E, Singh JP, Gupta A (1995) 22nd international symposium on computer architecture (ISCA), pp 24–36Li ML, Sasanka R, Adve SV, Chen YK, Debes E (2005) International symposium on workload characterization, pp 34–45Bienia C, Kumar S, Singh JP, Li K (2008) 17th international conference on parallel architectures and compilation techniques (PACT), pp 72–8

RiuNet

An efficient design space exploration framework to optimize power-efficient heterogeneous many-core multi-threading embedded processor architectures

Author: Datta Kushal
NC DOCKS at The University of North Carolina at Charlotte
Publication venue
Publication date: 01/01/2011
Field of study

By the middle of this decade, uniprocessor architecture performance had hit a roadblock due to a combination of factors, such as excessive power dissipation due to high operating frequencies, growing memory access latencies, diminishing returns on deeper instruction pipelines, and a saturation of available instruction level parallelism in applications. An attractive and viable alternative embraced by all the processor vendors was multi-core architectures where throughput is improved by using micro-architectural features such as multiple processor cores, interconnects and low latency shared caches integrated on a single chip. The individual cores are often simpler than uniprocessor counterparts, use hardware multi-threading to exploit thread-level parallelism and latency hiding and typically achieve better performance-power figures. The overwhelming success of the multi-core microprocessors in both high performance and embedded computing platforms motivated chip architects to dramatically scale the multi-core processors to many-cores which will include hundreds of cores on-chip to further improve throughput. With such complex large scale architectures however, several key design issues need to be addressed. First, a wide range of micro- architectural parameters such as L1 caches, load/store queues, shared cache structures and interconnection topologies and non-linear interactions between them define a vast non-linear multi-variate micro-architectural design space of many-core processors; the traditional method of using extensive in-loop simulation to explore the design space is simply not practical. Second, to accurately evaluate the performance (measured in terms of cycles per instruction (CPI)) of a candidate design, the contention at the shared cache must be accounted in addition to cycle-by-cycle behavior of the large number of cores which superlinearly increases the number of simulation cycles per iteration of the design exploration. Third, single thread performance does not scale linearly with number of hardware threads per core and number of cores due to memory wall effect. This means that at every step of the design process designers must ensure that single thread performance is not unacceptably slowed down while increasing overall throughput. While all these factors affect design decisions in both high performance and embedded many-core processors, the design of embedded processors required for complex embedded applications such as networking, smart power grids, battlefield decision-making, consumer electronics and biomedical devices to name a few, is fundamentally different from its high performance counterpart because of the need to consider (i) low power and (ii) real-time operations. This implies the design objective for embedded many-core processors cannot be to simply maximize performance, but improve it in such a way that overall power dissipation is minimized and all real-time constraints are met. This necessitates additional power estimation models right at the design stage to accurately measure the cost and reliability of all the candidate designs during the exploration phase. In this dissertation, a statistical machine learning (SML) based design exploration framework is presented which employs an execution-driven cycle- accurate simulator to accurately measure power and performance of embedded many-core processors. The embedded many-core processor domain is Network Processors (NePs) used to processed network IP packets. Future generation NePs required to operate at terabits per second network speeds captures all the aspects of a complex embedded application consisting of shared data structures, large volume of compute-intensive and data-intensive real-time bound tasks and a high level of task (packet) level parallelism. Statistical machine learning (SML) is used to efficiently model performance and power of candidate designs in terms of wide ranges of micro-architectural parameters. The method inherently minimizes number of in-loop simulations in the exploration framework and also efficiently captures the non-linear interactions between the micro-architectural design parameters. To ensure scalability, the design space is partitioned into (i) core-level micro-architectural parameters to optimize single core architectures subject to the real-time constraints and (ii) shared memory level micro- architectural parameters to explore the shared interconnection network and shared cache memory architectures and achieves overall optimality. The cost function of our exploration algorithm is the total power dissipation which is minimized, subject to the constraints of real-time throughput (as determined from the terabit optical network router line-speed) required in IP packet processing embedded application

The University of North Carolina at Greensboro